Machine learning (ML), data science (DS) and artificial intelligence (AI) are closely related disciplines.
A down-to-earth definition of machine learning
A taxonomy of machine learning problems
Traditional machine learning vs. deep neural networks
Machine learning methods can be broadly categorized into two groups: traditional machine learning and deep neural networks. Traditional machine learning includes linear regression, logistic regression, support vector machines, decision trees, random forests, and gradient boosting. These methods are known for their simplicity, interpretability, and computational efficiency.
On the other hand, deep neural networks are composed of multiple layers of artificial neurons that are trained to learn complex representations of the data. They include convolutional neural networks, recurrent neural networks, generative adversarial networks, and transformers). These models have the ability to capture complex patterns in the data. However, deep neural networks are computationally more intensive to train and less interpretable compared to traditional machine learning methods. Additionally, they require larger amounts of training data to achieve optimal performance.
In this semester's Machine Learning course, we will concentrate exclusively on traditional machine learning techniques. For those seeking further knowledge on deep neural networks, the Neural Networks course offered in the following semester may be of interest.
Some machine learning terms
Linear regression is a simple but still useful learning algorithm that dates back to the 19th century. It was first described by Galton in the 1880s and was later formalized by Fisher in the early 20th century. Despite its age, linear regression remains popular in numerous fields with its implementation constantly evolving in response to technological advancements.
Exercise 1: Perform univariate linear regression on the MLB baseball player data in the baseball.txt file, using height as the input and weight as the target!
# Load data.
import numpy as np
data = np.loadtxt('../_data/baseball.txt', delimiter=',')
data
array([[188., 82.],
[188., 98.],
[183., 95.],
...,
[190., 93.],
[190., 86.],
[185., 88.]])
data.shape
(1033, 2)
# Extract input vector (height) and target vector (weight).
x = data[:, 0]
y = data[:, 1]
# Subtract mean.
xm = x.mean()
ym = y.mean()
x -= xm
y -= ym
# Prepare a scatter plot of the data.
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.scatter(x, y, alpha=0.7)
<matplotlib.collections.PathCollection at 0x7f6508d96280>
# Train model (compute the optimal w value).
w = (x @ y) / (x @ x)
w
0.8561919786085516
# Predict the weight of a player who is 190 cm tall.
(190 - xm) * w + ym
94.01062994703861
# Draw regression line into the scatter plot.
plt.figure(figsize=(8, 8))
plt.scatter(x, y, alpha=0.7)
x2 = np.array([x.min(), x.max()])
plt.plot(x2, x2 * w, color='red')
[<matplotlib.lines.Line2D at 0x7f6508d5f280>]
Multivariate linear regression can be derived from univariate linear regression by adding additional input features to the model. The equation for multivariate linear regression is a extension of the equation for univariate linear regression, with multiple coefficients representing the strength and direction of the relationship between each input feature and the target feature.
import numpy as np
def gen_data(n, d):
rs = np.random.RandomState(42)
X = rs.random(size=(n, d)) # input matrix
v = np.arange(d) # v = [0, 1, 2, ...]
y = X @ v + rs.randn(n) # target vector
return X, y
# Generate artificial data set.
X, y = gen_data(1000, 50)
X
array([[0.37454012, 0.95071431, 0.73199394, ..., 0.52006802, 0.54671028,
0.18485446],
[0.96958463, 0.77513282, 0.93949894, ..., 0.42754102, 0.02541913,
0.10789143],
[0.03142919, 0.63641041, 0.31435598, ..., 0.50267902, 0.05147875,
0.27864646],
...,
[0.59548243, 0.53848475, 0.04274819, ..., 0.5590887 , 0.3681002 ,
0.92417205],
[0.88564332, 0.6376487 , 0.76785135, ..., 0.24720585, 0.3965254 ,
0.98608331],
[0.34848823, 0.70796571, 0.25040201, ..., 0.74899508, 0.52101091,
0.86170671]])
y
array([533.60298728, 576.13366431, 568.12812277, 668.4824252 ,
699.66909766, 589.5267887 , 531.56261332, 683.01173847,
611.73297041, 670.37772058, 715.47894326, 512.1816071 ,
548.33170173, 547.54283479, 662.85141039, 674.7433071 ,
584.58899671, 502.14390141, 597.78030216, 587.18554943,
610.16095671, 599.58362225, 689.52317345, 667.34908754,
675.11757122, 489.63025321, 624.32731482, 599.46980079,
656.0696828 , 616.87250615, 569.10848594, 585.34448774,
686.32595749, 618.91998319, 543.05711379, 562.62773713,
632.41948068, 641.94669327, 595.92341887, 569.30015718,
576.3116163 , 662.12125613, 678.7499848 , 652.48800985,
605.68772642, 456.88041884, 548.32996584, 605.71916399,
569.31211126, 693.44582378, 619.44611593, 602.04776211,
643.69561267, 561.92370543, 619.58205884, 650.20010529,
621.64986239, 611.40352106, 651.06500718, 654.82993352,
554.42711188, 614.83569792, 666.79837141, 605.95941933,
547.06723134, 621.97741424, 640.67422926, 648.20977603,
716.83188021, 651.91076231, 500.57651324, 640.58263339,
590.58867185, 568.9109459 , 619.29435075, 574.05327058,
605.73577567, 695.68606536, 537.35173283, 538.25900935,
652.74850841, 532.19107575, 618.53131718, 607.25191034,
623.66104662, 568.53264643, 656.79227203, 473.62135611,
614.91356804, 605.70400499, 561.20970362, 642.99802025,
696.33576079, 642.46787392, 517.61578357, 524.48600805,
554.09076698, 504.63102695, 751.71289379, 642.30392682,
653.14583051, 472.67124327, 657.32912328, 642.01537252,
639.0730697 , 518.22110336, 585.76193653, 639.40326235,
542.53993253, 665.17391855, 521.8277348 , 674.65774615,
547.81463406, 613.37979208, 677.80999852, 668.40823459,
591.20879292, 666.5390364 , 600.07729159, 603.61615141,
623.09928054, 557.87895829, 611.51508082, 533.12080234,
663.66743294, 661.22474711, 571.87826526, 657.56664128,
604.71163486, 586.95552031, 612.11574268, 589.93409388,
693.13586421, 594.011437 , 621.13502496, 639.67134288,
593.32038589, 615.35191525, 574.94837843, 630.45834706,
606.22652073, 567.97662018, 656.16447389, 683.91444323,
636.82740691, 546.53414434, 615.66589955, 566.15028732,
471.8834159 , 598.04627214, 583.03172294, 600.57121455,
569.01119203, 659.50105997, 579.80222348, 553.34736908,
575.22807612, 587.82569585, 598.24880681, 586.01660687,
597.5625893 , 636.10118627, 592.5652126 , 499.17237395,
600.7445794 , 606.46160281, 689.31079555, 655.5960607 ,
524.48498942, 629.13670655, 650.13830504, 649.53546881,
577.90265774, 518.61555531, 497.0424331 , 551.33073777,
603.62498008, 551.73687332, 659.64630027, 597.91619141,
578.51980672, 603.47164092, 690.03075666, 671.86238948,
625.78996809, 661.8602284 , 729.8911097 , 572.71491354,
554.41724055, 597.3477592 , 566.28212371, 521.36150236,
547.10801184, 616.71925385, 578.4330811 , 635.18867233,
689.55171783, 602.05577751, 548.18032969, 650.13863673,
631.01827903, 604.80695929, 608.69866534, 588.49399087,
621.87653762, 598.02005307, 610.12640543, 729.56443758,
676.07104522, 503.42067556, 655.49705406, 682.92733235,
727.22528663, 615.11945768, 582.02104882, 553.83331239,
678.69101371, 502.678386 , 578.79048798, 676.21132215,
661.11509085, 667.26781244, 698.36804444, 668.78721438,
604.17754807, 540.4209061 , 616.69821954, 648.16955813,
661.66103627, 603.96221627, 696.89319694, 631.71336684,
586.30145422, 602.16015884, 510.73287034, 595.42278933,
665.34861272, 543.01226639, 581.78384363, 573.34674288,
654.69671441, 600.88320224, 570.38517583, 608.55258658,
557.4073946 , 594.40440783, 595.78296903, 646.71636438,
623.53197391, 568.44820564, 538.95665799, 621.72638544,
642.60249889, 646.1445285 , 592.0743679 , 669.9656152 ,
548.39466463, 631.92584084, 580.67809121, 575.80529514,
644.87415208, 593.90147586, 607.63139237, 606.58725191,
616.06135945, 575.00388626, 587.65064288, 500.36799009,
699.20015182, 593.03502784, 624.29806561, 560.34691305,
654.37674598, 609.87337569, 642.15627111, 722.04916187,
531.16747377, 653.05954546, 573.65450751, 570.12047677,
475.08051767, 705.69942419, 538.03018984, 570.75209522,
641.1960784 , 610.44192587, 504.66054411, 679.0884266 ,
672.84248442, 569.7572077 , 609.42141973, 722.5083582 ,
709.49934623, 572.00105454, 635.08571362, 676.52917817,
600.89729802, 549.34067907, 572.12848879, 643.7433099 ,
607.23137432, 492.68094828, 675.06104281, 529.30200858,
611.15121975, 739.02445615, 582.76424946, 660.38093602,
587.62435075, 612.58744873, 607.65066802, 556.48612853,
671.72899473, 639.11993107, 582.1178745 , 551.20980852,
614.19399348, 531.73736577, 540.40089093, 736.66257032,
679.48118563, 672.22324735, 684.19967443, 696.21307649,
591.04139721, 623.01695893, 620.26676803, 533.03247725,
680.31320793, 505.59565035, 629.66793349, 647.38144826,
614.97689596, 618.89638808, 684.44546763, 614.68282113,
643.55409054, 640.80228437, 538.100829 , 622.77267511,
626.77515112, 670.39881215, 696.0788142 , 550.0778602 ,
665.57932891, 668.86078953, 585.41222607, 711.82624601,
546.9341624 , 634.62922714, 786.40233277, 607.07740213,
639.62142236, 651.94818927, 554.2227789 , 570.41734727,
558.2027597 , 619.00725771, 639.51684098, 636.30851736,
739.1201776 , 620.27969997, 650.34883179, 595.54195358,
479.3673627 , 631.82007192, 687.43099107, 598.72888314,
601.10095784, 629.88351296, 619.20240238, 577.40078056,
703.57668227, 665.78740222, 648.17914085, 774.92275966,
781.90921519, 563.43925188, 643.39388458, 566.09404412,
519.79678831, 594.74727539, 643.56960478, 664.18397602,
690.8750879 , 731.99324603, 715.5187224 , 705.52520146,
559.3362851 , 616.74816172, 667.00177421, 635.21573188,
590.92167427, 578.07548852, 652.84903089, 592.99599913,
615.22138438, 611.94903049, 632.78555043, 636.7236512 ,
542.52025178, 632.73032389, 607.82876332, 611.04445021,
655.13553522, 660.22297873, 638.31024622, 579.66858404,
571.7927038 , 570.29038681, 651.72951633, 653.72358497,
596.68962796, 645.9541516 , 648.4485629 , 671.98789041,
687.22543313, 625.29068989, 599.56980914, 628.30832786,
649.61449114, 566.60937051, 613.13883947, 562.37835437,
587.51125863, 577.18067916, 681.32398531, 613.132692 ,
591.05125516, 656.22650697, 644.75917497, 652.33058262,
632.23440858, 650.04465811, 572.27177796, 698.54026897,
616.33162248, 608.16870789, 644.52815974, 685.5823023 ,
649.51158083, 659.0006635 , 645.5499349 , 541.42496491,
672.60322035, 688.09550422, 642.61070693, 617.30952031,
643.72646378, 787.79918576, 719.78186545, 656.30187592,
571.51880472, 722.15541177, 513.45329347, 674.07730907,
651.02504079, 666.44537741, 651.36917687, 591.37113418,
607.01875954, 604.78906127, 645.36512819, 547.36342469,
603.01378175, 589.32265454, 643.34615534, 612.72849284,
679.30796188, 680.26642967, 677.78581026, 596.88495621,
606.85436245, 650.4596151 , 624.15318038, 606.45678827,
659.04682675, 649.33629557, 567.9283756 , 571.59229027,
550.30252869, 553.11815621, 559.25987661, 694.3622247 ,
632.00071189, 573.88304869, 634.00871259, 646.08180541,
493.91106496, 641.43733246, 641.42231373, 607.7126624 ,
531.51156738, 634.83672642, 616.59358544, 636.95107853,
571.86548414, 509.36017835, 609.89304418, 646.78063686,
581.19459218, 717.64703024, 579.76687635, 684.26623553,
586.09791714, 531.28934636, 562.99109934, 697.61053687,
586.79906326, 587.41256026, 612.73090422, 569.29562228,
622.56436724, 599.25282599, 541.8207659 , 596.47404425,
625.88948954, 596.49949992, 661.85854444, 719.5040606 ,
512.82627844, 661.14458522, 493.08375925, 664.98330925,
636.05695839, 624.11639716, 607.74566853, 602.0158472 ,
644.55849141, 668.3162859 , 648.2019683 , 637.12463597,
630.88928865, 520.92420437, 687.93503033, 549.39997795,
585.01737779, 544.40292729, 505.6162989 , 610.63377011,
644.70315104, 636.19666595, 740.93901813, 659.87991229,
542.12112861, 628.23336427, 613.04608524, 602.75376798,
621.3284228 , 647.42401243, 609.77404678, 696.82486181,
514.54099296, 543.76223298, 554.65047141, 680.0459826 ,
597.5740599 , 573.14884552, 610.07792897, 603.63099765,
632.47344205, 553.69058746, 549.45207478, 613.52747865,
545.79365067, 531.04345106, 594.89409584, 537.62980135,
710.89463077, 536.96741227, 598.85231598, 657.84199334,
658.67049632, 608.70002868, 526.38117565, 638.30729035,
606.07245338, 682.04643892, 571.30179895, 604.51706603,
571.24754468, 618.08891457, 718.21515755, 531.03097587,
532.79068639, 545.82327881, 552.42521371, 577.43885495,
576.83762354, 548.25556041, 672.36176898, 606.38250352,
628.98545182, 706.26599122, 640.61681382, 700.72741296,
491.56243555, 626.97576169, 676.20785856, 523.10264845,
598.63204107, 679.01271397, 606.07482248, 667.17491807,
644.27749008, 636.82573797, 620.72854646, 714.15523499,
595.57564896, 605.96166716, 600.75723233, 681.96887017,
554.26286299, 588.51394391, 580.63306922, 598.7405639 ,
621.25137797, 640.84797153, 617.08749681, 672.19518372,
637.68485235, 632.62071636, 703.64175583, 660.67474679,
495.44905107, 604.09504246, 530.33149763, 693.86774842,
513.7268996 , 601.44990932, 637.57628935, 758.26773573,
621.43548783, 572.34241919, 652.88992303, 627.56135417,
623.11572252, 606.35813937, 625.49863576, 702.41088271,
654.35028581, 636.8930066 , 551.30788687, 597.52595054,
557.45785057, 567.35624984, 677.32313942, 615.46581558,
655.2705666 , 543.61512094, 594.32745871, 613.82706026,
587.85567549, 629.4350337 , 595.65304923, 711.46127586,
563.72590498, 613.45273989, 702.70130692, 599.48192021,
659.33399463, 517.71079822, 629.30287401, 575.22698328,
586.63317443, 679.86964867, 592.11086791, 635.34191572,
740.1664258 , 473.71057609, 592.12749434, 582.44219562,
581.6568571 , 659.31551499, 552.29400302, 515.77364034,
536.05460574, 580.52952892, 622.70396584, 666.57656233,
615.84988701, 688.21947633, 581.70544319, 581.84270347,
592.62519429, 595.22889327, 591.41319213, 524.97864302,
644.37658007, 584.79024451, 620.54656898, 676.24842862,
702.3441974 , 510.96556776, 599.67546431, 501.51660459,
719.86353412, 636.05922026, 711.62624341, 649.48967623,
560.53318347, 588.16044366, 602.42413431, 603.97382993,
621.83560155, 609.92379915, 471.63047473, 685.02997902,
701.43015855, 592.57834997, 581.75065625, 621.61284929,
555.12830835, 602.74036201, 584.66291011, 617.13682821,
583.65566214, 635.38532553, 493.83781479, 674.34166563,
592.14171789, 541.33812952, 620.11477598, 664.67664224,
639.69045432, 587.69931575, 540.96620902, 592.59964586,
565.94241844, 621.57501691, 549.30971686, 624.79228204,
597.43503521, 506.26315454, 643.5618499 , 514.56704925,
640.40201105, 596.73176285, 581.73976091, 603.56661306,
641.81984096, 578.56346326, 707.36415539, 536.61590984,
731.51717985, 634.98465732, 700.2702999 , 597.53790895,
542.70601906, 714.20733167, 548.56526488, 697.90057861,
595.34924103, 739.78404843, 656.12078043, 580.38821765,
555.628187 , 474.84929602, 619.24834379, 592.83876597,
656.54278653, 693.06845681, 679.48443453, 614.85358042,
614.09890856, 659.74841638, 670.21572253, 496.11644326,
594.72379753, 765.13774235, 599.93957237, 521.69606985,
591.93319402, 624.98486045, 623.80916963, 660.66725011,
544.59609113, 591.4370003 , 584.59328677, 696.33135916,
707.61728437, 590.27637688, 596.55252407, 591.45481015,
548.4701266 , 633.63595543, 649.70377691, 673.72902008,
552.92932245, 606.40282452, 584.53434685, 752.62566101,
571.75329378, 592.65535747, 614.80788995, 533.38738288,
579.9495681 , 590.86712324, 626.63388383, 594.02701696,
631.63489784, 656.78677556, 744.21891956, 587.69007297,
657.45591031, 573.31804146, 686.250548 , 601.57952234,
625.18003127, 721.84826587, 538.8234735 , 505.01823449,
533.52939041, 580.37144198, 496.44100455, 599.84300393,
575.36883759, 674.98528742, 625.42667175, 572.20235894,
735.79558501, 602.22759331, 599.5237387 , 536.79394786,
642.17814186, 648.42938982, 734.23976115, 451.83426741,
571.46926646, 605.04313018, 598.6540863 , 659.95044659,
528.50897078, 601.25171419, 552.25815484, 570.92400029,
792.59658143, 613.13959698, 608.98998135, 651.79474898,
618.99072373, 746.8657187 , 568.92827496, 647.31331895,
627.17368929, 662.46131352, 625.36223152, 575.26758442,
543.96515094, 599.53071918, 675.46697587, 749.05094747,
696.84920026, 440.53615803, 528.86401269, 692.35398861,
673.21872973, 593.5864642 , 595.78774755, 528.84468049,
559.70091186, 625.87771933, 579.69442979, 674.43979652,
572.17237554, 603.39481608, 550.75156319, 563.86864191,
704.06112426, 555.08305997, 587.77069532, 562.36272498,
583.63720165, 654.48570917, 614.36957386, 595.14021805,
625.638566 , 629.92067973, 597.72558798, 581.25913459,
512.77123543, 676.31559803, 659.44825695, 541.0299485 ,
598.82381955, 614.90290364, 519.49045193, 589.90374501,
559.49840085, 551.12579797, 563.86969334, 570.65384216,
662.98125595, 735.27808143, 561.54839388, 604.47001057,
580.28405674, 705.27980347, 652.16804913, 590.57759793,
699.66325413, 617.4386229 , 647.89067269, 685.87420105,
595.72940081, 706.55203936, 641.52383763, 595.39275637,
523.89905244, 724.36611195, 518.84495704, 709.83691295,
599.14970439, 642.88188836, 589.25484469, 634.19214729,
666.08314297, 496.15239549, 548.90984186, 676.12896525,
568.92809567, 582.84250843, 586.55623748, 699.20128293,
593.81007788, 629.67580693, 600.0248951 , 688.27003095,
631.88812573, 656.75764748, 627.65260717, 631.26607072,
577.89363379, 624.12716934, 592.0800309 , 609.25219121,
629.1641768 , 634.20099076, 587.22353007, 633.47057924,
582.98909836, 593.24733883, 549.49017531, 637.8482583 ,
619.09142305, 679.87318876, 663.9418921 , 579.19992964,
667.38698284, 435.55521922, 651.43760131, 625.1144892 ,
554.10561686, 602.23901348, 662.84512438, 688.61348317,
711.49936184, 672.74983533, 528.77517752, 517.69225477,
681.37258107, 612.7823142 , 464.38642279, 596.08251274,
568.30684831, 560.22974029, 575.86177965, 549.33001301,
682.19604038, 575.09686166, 625.40196453, 543.05207989,
687.87796496, 658.49866923, 604.99315496, 615.93869084,
578.37359144, 525.08622066, 607.38720982, 582.38630657,
631.67616243, 557.72703224, 626.53495027, 562.70024147,
679.14318766, 528.0827956 , 683.892276 , 660.53735251,
588.04826164, 698.04267716, 558.80716068, 617.73761709,
583.43653814, 579.46156286, 502.99029364, 701.87953073,
627.69084159, 530.99909756, 597.38904177, 618.10516116])
X.shape
(1000, 50)
%%time
# Compute optimal parameter vector.
w = np.linalg.inv(X.T @ X) @ (X.T @ y)
w
CPU times: user 5.76 ms, sys: 336 µs, total: 6.09 ms Wall time: 5.65 ms
array([-6.27668535e-03, 7.79377042e-01, 2.12007621e+00, 2.91175236e+00,
3.87036693e+00, 5.10677902e+00, 5.99889354e+00, 6.96317023e+00,
7.85544458e+00, 9.14518616e+00, 9.91401623e+00, 1.10795175e+01,
1.18655076e+01, 1.29230253e+01, 1.39493048e+01, 1.50584688e+01,
1.59630538e+01, 1.70070564e+01, 1.78704537e+01, 1.91984375e+01,
1.98991713e+01, 2.09711802e+01, 2.20835181e+01, 2.31399674e+01,
2.40855004e+01, 2.49690066e+01, 2.61665348e+01, 2.69094644e+01,
2.80304559e+01, 2.89075191e+01, 3.00075384e+01, 3.10977986e+01,
3.18331446e+01, 3.32259644e+01, 3.40089749e+01, 3.53181296e+01,
3.59508608e+01, 3.69689455e+01, 3.76996546e+01, 3.89252973e+01,
4.02508652e+01, 4.11066389e+01, 4.19419945e+01, 4.29471376e+01,
4.39179113e+01, 4.51380313e+01, 4.60397828e+01, 4.69299566e+01,
4.78870045e+01, 4.91419825e+01])
%%time
# ...more efficient way
np.linalg.solve(X.T @ X, X.T @ y)
CPU times: user 2.96 ms, sys: 657 µs, total: 3.62 ms Wall time: 3.01 ms
array([-6.27668535e-03, 7.79377042e-01, 2.12007621e+00, 2.91175236e+00,
3.87036693e+00, 5.10677902e+00, 5.99889354e+00, 6.96317023e+00,
7.85544458e+00, 9.14518616e+00, 9.91401623e+00, 1.10795175e+01,
1.18655076e+01, 1.29230253e+01, 1.39493048e+01, 1.50584688e+01,
1.59630538e+01, 1.70070564e+01, 1.78704537e+01, 1.91984375e+01,
1.98991713e+01, 2.09711802e+01, 2.20835181e+01, 2.31399674e+01,
2.40855004e+01, 2.49690066e+01, 2.61665348e+01, 2.69094644e+01,
2.80304559e+01, 2.89075191e+01, 3.00075384e+01, 3.10977986e+01,
3.18331446e+01, 3.32259644e+01, 3.40089749e+01, 3.53181296e+01,
3.59508608e+01, 3.69689455e+01, 3.76996546e+01, 3.89252973e+01,
4.02508652e+01, 4.11066389e+01, 4.19419945e+01, 4.29471376e+01,
4.39179113e+01, 4.51380313e+01, 4.60397828e+01, 4.69299566e+01,
4.78870045e+01, 4.91419825e+01])
Exercise 3: Plot the time complexity of training the model!
def fit_lin_reg(X, y):
return np.linalg.solve(X.T @ X, X.T @ y)
import time
import pandas as pd
time.time()
# How the running time grows with n?
data = []
for n in [5000, 10000, 15000, 20000, 25000, 30000, 35000]:
print(n)
X, y = gen_data(n, 50)
t0 = time.time()
fit_lin_reg(X, y)
t1 = time.time()
data.append({
'n': n,
'time': t1 - t0
})
pd.DataFrame(data).set_index('n').plot()
5000 10000 15000 20000 25000 30000 35000
<AxesSubplot: xlabel='n'>
# How the running time grows with d?
data = []
for d in [50, 100, 200, 500, 1000, 1500, 2000]:
print(d)
X, y = gen_data(20000, d)
t0 = time.time()
fit_lin_reg(X, y)
t1 = time.time()
data.append({
'd': d,
'time': t1 - t0
})
pd.DataFrame(data).set_index('d').plot()
50 100 200 500 1000 1500 2000
<AxesSubplot: xlabel='d'>
To include a bias term in $d$-variate linear regression, one can add an all-ones column to the end of the input matrix, making it $X'=[X\ |\ 1]$, and append a bias parameter to the end of the parameter vector, $w'=[w\ |\ \beta]$. This creates a ($d+1$)-variate model is equivalent to a $d$-variate model with a bias term, because $X'w' = Xw + \beta$
# Incorporate a bias term into the previous model.
X, y = gen_data(1000, 50)
o = np.ones((1000, 1))
X_pr = np.hstack([X, o])
w_pr = fit_lin_reg(X_pr, y)
w, beta = w_pr[:50], w_pr[50]
w
array([-3.06401464e-03, 7.82222930e-01, 2.12108232e+00, 2.91386326e+00,
3.87228420e+00, 5.11026964e+00, 6.00067559e+00, 6.96542866e+00,
7.85673963e+00, 9.14797989e+00, 9.91686530e+00, 1.10816627e+01,
1.18689718e+01, 1.29260957e+01, 1.39518797e+01, 1.50615225e+01,
1.59648657e+01, 1.70097186e+01, 1.78728642e+01, 1.92008891e+01,
1.99009444e+01, 2.09740852e+01, 2.20863362e+01, 2.31414739e+01,
2.40883621e+01, 2.49715111e+01, 2.61689865e+01, 2.69129379e+01,
2.80332758e+01, 2.89101312e+01, 3.00091476e+01, 3.11004170e+01,
3.18356050e+01, 3.32282182e+01, 3.40110034e+01, 3.53203475e+01,
3.59551113e+01, 3.69712921e+01, 3.77020377e+01, 3.89276599e+01,
4.02537764e+01, 4.11091654e+01, 4.19436299e+01, 4.29507093e+01,
4.39199257e+01, 4.51407203e+01, 4.60427084e+01, 4.69329446e+01,
4.78878303e+01, 4.91438904e+01])
beta # bias (or intercept) term
-0.06197959478839162
When the input matrix is sparse, meaning it has many zero values, traditional linear regression methods perform poorly. To avoid storage inefficiencies, the input matrix should be represented in a sparse matrix format such as COO, CSR, or CSC. To handle sparse data effectively, alternative training algorithms should be employed that only consider the non-zero locations in the input matrix.
import numpy as np
import scipy.sparse as sp
def gen_sparse_data(n, d, density):
X = sp.random(n, d, density=density, random_state=42) # input matrix
v = np.arange(d)
y = X @ v + np.random.RandomState(42).randn(n) # target vector
return X, y
# Generate artificial data set.
X, y = gen_sparse_data(1000, 100, 0.01)
X
<1000x100 sparse matrix of type '<class 'numpy.float64'>' with 1000 stored elements in COOrdinate format>
y
array([ 8.31573963e+01, -1.38264301e-01, 6.47688538e-01, 6.95470361e+01,
8.64440171e+01, 2.21265946e+01, 3.92455292e+01, 7.67434729e-01,
4.22673463e+00, 5.42560044e-01, 2.18408047e+01, 1.10375192e+01,
1.03918651e+01, 5.47211613e+01, -1.66547943e+00, 1.97022890e+01,
-1.01283112e+00, 2.80787047e+01, 6.96266404e-01, 2.23033667e+01,
1.46564877e+00, 2.30734549e+01, 3.49393826e+01, -1.42474819e+00,
-5.44382725e-01, 1.10922590e-01, 4.16321594e+01, 3.75698018e-01,
5.49280248e+00, 3.97282594e+01, -6.01706612e-01, 1.85227818e+00,
-1.34972247e-02, -1.05771093e+00, 7.60710146e+01, 4.87893739e+00,
2.08863595e-01, 3.83735573e+01, 8.49340845e+01, 7.63919181e+01,
9.46626266e+00, 1.71368281e-01, 5.58220928e+01, 6.86964546e+01,
-1.47852199e+00, -5.28017778e-01, 9.83779634e+00, 1.05712223e+00,
3.43618290e-01, 2.54551279e-01, 5.83283462e+00, -3.85082280e-01,
4.78181839e+00, 6.11676289e-01, 2.07980776e+01, 5.18802057e+01,
4.73186585e+01, 1.97540367e-01, 3.31263431e-01, 9.75545127e-01,
1.03568355e+01, -1.85658977e-01, -1.10633497e+00, -1.19620662e+00,
4.12787590e+01, 1.35624003e+00, 5.05158682e+01, 5.77108280e+01,
3.02452754e+01, -6.45119755e-01, 1.02726922e+02, 5.49761660e+01,
-3.58260391e-02, 6.88264052e+00, 2.03846593e+01, 1.18143324e+01,
2.76285836e+01, -2.99007350e-01, 3.97305882e+01, -1.98756891e+00,
2.58469048e+01, 3.57112572e-01, 1.47789404e+00, 1.11348614e+02,
3.13910873e+01, 2.71888802e+00, 6.79054939e+00, 6.77364818e+01,
-5.29760204e-01, 8.23053246e+00, 9.70775493e-02, 5.13648346e+01,
2.60305176e+00, 2.45473793e+01, -3.92108153e-01, -1.46351495e+00,
3.99643569e+00, 5.10159902e+01, 5.11345664e-03, 5.87251946e+01,
5.16514622e+01, -4.20645323e-01, 2.53658381e+01, -8.02277269e-01,
1.28791475e+00, 4.04050857e-01, 6.76665596e+01, 1.74577813e-01,
2.98084866e+01, -7.44459158e-02, 6.79093020e+00, 4.29789834e+01,
6.02302099e-02, 2.04598018e+01, -1.92360965e-01, 1.44403438e+01,
1.95949131e+01, -1.16867804e+00, 1.14282281e+00, 3.61953477e+01,
3.43137744e+01, 1.88298575e+01, 2.48449536e+01, -1.40185106e+00,
5.86857094e-01, 5.29688872e+01, 1.26765140e+02, 9.75571757e+01,
1.14887010e+01, 2.76550142e+01, 1.43791628e+02, 6.85629748e-02,
-1.06230371e+00, 4.73592431e-01, 4.55548677e+00, 1.54993441e+00,
-7.83253292e-01, 4.49772935e+01, 8.13517217e-01, 2.02839835e+01,
2.27459935e-01, 4.29545612e+01, 5.82367303e+01, 4.33288648e+01,
4.60816587e-01, 2.89864031e+01, -1.23695071e+00, 6.54689714e+00,
4.10995885e+01, 8.84171406e+01, 4.04387533e+01, 4.99520453e+01,
-6.06055809e-02, 6.79186970e+00, 2.93072473e-01, 1.86771497e+01,
3.02166893e+01, 4.73832921e-01, -1.19130350e+00, 6.56553609e-01,
9.45499187e+00, 7.46181396e+01, 6.58723853e+00, 1.42093651e+01,
9.63376129e-01, 4.12780927e-01, 7.07775323e+00, 1.89679298e+00,
8.41290791e+01, 6.24849649e+01, 1.82818663e+01, -8.15810285e-01,
3.27215858e+01, 1.22751011e+01, 5.44572489e+01, 3.21836378e+00,
1.19929473e+01, 2.77868493e+01, 5.80690546e+01, 2.72016917e+00,
1.38297775e+00, 9.72690260e+00, 9.80141738e+01, 1.87712364e+01,
1.01685084e+02, 7.14000494e-01, 1.11945097e+00, 4.21273124e+01,
-8.46793718e-01, -1.51484722e+00, -4.46514952e-01, 8.56398794e-01,
5.40698615e+01, -1.24573878e+00, 3.63481379e+01, 7.25758028e+01,
1.96335659e+01, 1.53725106e-01, 1.00433263e+02, 1.43237010e+02,
4.73724577e+00, 3.73670227e+01, 1.08305124e+00, 6.73549627e+01,
-1.37766937e+00, -9.37825040e-01, 4.77105981e+01, 5.44979552e+01,
5.15047686e-01, 3.85273149e+00, 5.70890511e-01, 1.13556564e+00,
7.73548571e+01, 6.51391251e-01, -3.15269245e-01, 1.42822049e+00,
-7.72825215e-01, 1.59616992e+01, -4.85363548e-01, 8.18741394e-02,
1.44402904e+01, 3.86292081e+00, 1.00266907e+02, 3.76849022e+01,
-4.71931866e-01, 1.08895060e+00, 2.16212000e+01, -1.07774478e+00,
2.06898431e+01, 6.79597749e-01, -3.54486453e-01, 2.44512499e+01,
1.36196077e+00, 8.32323471e+01, 2.93072279e+01, 7.33944448e+01,
-2.02514259e+00, 2.25333934e+01, 1.66713020e+01, 8.52433335e-01,
-7.92520738e-01, 5.61620463e+01, 8.90243046e+00, 8.65755194e-01,
1.29635554e+01, 7.68271729e+00, 4.27688855e+01, 1.28262901e+01,
1.76545424e+00, 4.04981711e-01, 4.69833104e+00, 9.17861947e-01,
2.12215620e+00, 8.60575729e+01, -1.51936997e+00, -4.84234073e-01,
8.22274809e+00, 5.55100951e+00, 4.43819428e-01, 7.74634053e-01,
1.06369479e+02, 2.19213539e+01, -3.24126734e+00, 6.64636958e+00,
-2.52568151e-01, -1.24778318e+00, 2.82057736e+01, 4.73247527e-01,
1.00599230e+02, 4.93961865e+01, 1.44127329e+00, 1.40189463e+02,
3.93591458e+01, 2.56733063e+01, -9.81508651e-01, 4.62103474e-01,
1.76568675e+01, 5.40405966e+01, 6.98020850e-02, -3.85313597e-01,
4.96773188e-01, 6.62130675e-01, 2.61945903e+01, -1.23781550e+00,
3.30471221e+01, 4.78361330e+00, 7.61129175e+00, 1.05746970e+00,
2.80991868e-01, -6.22699520e-01, 8.79536879e+01, -4.93000935e-01,
2.90190986e+01, 2.76585711e+01, 9.35892809e-01, -6.92909595e-01,
3.17110434e+01, 1.09949986e+01, 2.34156377e+01, 1.48480502e+01,
-8.28995011e-01, 1.18766365e+01, 7.47293605e-01, 2.43135695e+01,
1.26711626e+02, 1.17327383e-01, 1.27766490e+00, -5.91571389e-01,
5.47097381e-01, 5.67527170e+01, -2.17681203e-01, 6.57839713e+00,
8.25416349e-01, 8.13509636e-01, 5.61931174e+01, 1.75633693e-01,
6.64698341e+01, 1.54212157e+00, 3.44555168e+01, -1.30143054e-01,
9.69959650e-02, 5.95157025e-01, 7.66131428e+01, 8.21353055e+01,
1.15169397e-01, 3.66658975e+01, 1.15811087e+00, 7.91662694e-01,
6.24119817e-01, 6.28345509e-01, 2.02502433e+01, 5.39175357e+01,
3.82728983e+01, 6.23718151e+01, 1.85787209e+01, 6.71337391e+00,
4.04457891e+01, 4.52657584e+01, 2.06144367e+01, -5.63724553e-01,
1.99416430e+02, 2.43687211e-01, 8.98584268e+01, 7.51633374e+00,
1.19782087e+01, 1.24633695e+02, 7.02153059e+01, -1.40746377e+00,
5.79974267e+01, 2.65373568e+01, 2.66271049e+01, 8.85130758e+01,
8.57659623e-01, -1.59938530e-01, -1.90162079e-02, 4.46480234e+00,
4.34368959e+00, 2.06567791e+01, 3.59505607e+01, 2.28317267e+01,
7.22629805e+01, 1.36033719e+01, -1.08760148e-01, 7.68308484e+01,
3.07395464e+00, 6.64547789e+01, 2.24092482e-01, 2.95300730e+00,
4.82770598e+01, -7.73009784e-01, 4.15603092e+01, 4.97998291e-01,
9.33562226e+01, 2.19014259e+00, 3.78220633e+01, 1.25486238e+02,
4.73801860e+01, 1.83342006e-01, 3.80737846e+01, -8.08298285e-01,
1.39005726e+02, 6.08822885e+01, -2.12389572e+00, 3.88696494e+01,
-7.59132662e-01, 1.50393786e-01, 1.35825652e+01, 3.61440335e+00,
9.50423838e-01, -5.76903656e-01, -8.98414671e-01, 4.91919172e-01,
3.02120277e+01, 2.67616669e+00, 2.42169002e+01, -4.69175652e-01,
8.24297616e+01, 1.35387237e+00, -1.14539845e-01, 1.08188426e+02,
-1.49572208e+00, -5.99375023e-01, 7.00940125e+01, 2.22740112e+01,
1.27243791e+02, 2.55045644e+00, -1.06762043e+00, 6.40684534e+00,
1.20295632e-01, 2.72332007e+01, 7.11614878e-01, 5.89679716e+01,
5.97441214e+01, 1.61640335e+02, 1.39718940e+01, -7.48486537e-01,
1.55115198e+00, 3.34144427e+01, 5.16432718e+01, 6.65966229e+01,
2.06074792e+00, 1.75534084e+00, 9.20578063e+01, 9.85616673e+00,
1.44463077e+01, 1.36863156e+00, -9.64923461e-01, 2.05624597e+01,
1.05842449e+00, 3.58637432e+01, -6.37989214e-01, 6.54490662e+01,
1.83611619e+01, 7.17542256e-01, 1.41255951e+02, 7.40947804e-02,
3.05577524e+01, -1.38010146e+00, 3.42395373e+01, 1.18725059e+01,
3.84065449e-01, -3.26947481e-02, -2.06744210e+00, 1.58587098e+02,
4.66790464e+01, 6.69672549e-01, 1.26456017e+00, 3.40282466e+00,
-5.75542776e-02, 4.11257648e+01, -6.26790973e-02, 2.77832205e+01,
7.43712828e+01, 5.04046516e-01, -5.30257618e-01, -7.92872832e-01,
1.03721086e+01, 4.98694723e+01, 7.99487001e+00, 1.62559326e+01,
7.63678036e+01, 3.76920016e+01, -6.99725508e-01, 1.31942993e+01,
5.14391877e+01, 7.94204690e+01, 7.17136927e+00, 7.51908861e+01,
3.72234846e+01, 5.65192141e+00, -2.75051697e-01, -2.30192116e+00,
-1.51519106e+00, 1.36687427e+00, 1.64496771e+00, -2.49036040e-01,
5.32414315e+01, 4.30709040e+01, 3.07888081e+00, 1.11957491e+00,
-1.27917591e-01, 1.52634570e+01, -1.60644632e+00, 5.57105335e+01,
1.16309848e+02, 6.09417165e+01, -6.46572884e-01, 3.21627539e+01,
1.68714164e+00, 8.81639757e-01, 7.46772345e+01, 1.47994414e+00,
7.73683076e-02, 8.13722505e+01, 1.59644935e+01, 6.79331170e+00,
4.87763742e+01, -1.90338678e-01, 2.49426678e+01, 7.03843620e+00,
9.26177548e-01, 3.17672255e+00, 1.16686054e+02, 5.62969237e-01,
-6.50642569e-01, 1.07150542e+02, 8.39534325e+01, -8.63990770e-01,
4.85216279e-02, 4.37976798e+01, 2.95769697e+00, -5.02381094e-02,
7.50823129e+00, -9.07563662e-01, 1.87978653e+01, 2.90833958e+01,
5.00917188e-01, 8.11837691e+01, 9.93323054e-02, 7.51387123e-01,
-1.66940528e+00, 5.43360192e-01, -6.62623759e-01, 5.70598669e-01,
2.01561507e+01, 2.27567319e+01, -1.62754244e+00, 3.95768443e+01,
2.59722502e-01, 1.61686476e+01, 6.38592459e-01, 4.19087298e+00,
-6.60797986e-02, 1.36998691e+02, -6.51836108e-01, 4.73986713e-02,
-8.60413365e-01, 1.89725444e+01, 1.00629281e+00, 4.24739081e+01,
8.35692112e-01, 7.93387525e+01, 5.29804178e-01, 4.41733903e+01,
3.86973932e+01, -7.96895255e-01, 7.37053129e+01, 4.74841061e-01,
5.10833966e+01, -6.03985187e-01, 1.50989125e+01, 1.21745387e+01,
1.16778206e+00, 2.54420843e-01, 3.37602662e-01, -4.11876966e-01,
1.26717683e+01, 5.73758509e+01, 3.94452142e-01, 8.87406702e+01,
3.81287869e+00, 2.07540080e+00, 6.55968964e+00, -3.26023532e-01,
1.20121392e+00, 1.62647187e+00, 3.63952241e+01, -1.00808631e+00,
2.78024759e+01, 2.77601207e+01, 1.64778359e+01, 1.07462226e+01,
1.89941344e+01, 6.66712498e+01, 6.56382984e+01, 2.77760398e+01,
2.35614558e-01, 6.12219163e+01, -1.47858625e+00, 4.49509416e+01,
3.38496407e-01, 4.54953373e+01, 7.97833334e+01, 2.31566775e+01,
1.81866255e-01, 6.93919210e+01, 2.07867859e+01, 1.03899509e+02,
8.30335817e-01, -8.56083826e-01, 5.79588325e+01, 2.90545928e+01,
7.00914566e+00, 4.23748679e+01, 1.16877792e+02, 6.96157173e+01,
2.49616759e+01, 4.54150876e+01, 2.61693212e+01, 3.77300493e-01,
7.03746235e+00, 1.15177575e+01, 4.25185313e+00, 2.36037069e+01,
4.13434903e-01, 5.60161717e+01, 5.32575596e+01, 3.21935944e+01,
-1.58533687e-01, 8.55549632e+01, 5.74519657e+01, -5.55846709e-02,
2.33751893e+01, 1.09625159e+01, 3.22877584e+01, 1.29221182e-01,
1.09394795e-01, 7.25766624e-01, 1.90278093e+00, 2.23884024e-01,
-7.90474455e-01, 8.33877891e+01, 1.88202450e+00, 3.07089917e+01,
3.40192831e+00, -5.11215676e-01, 6.19587883e+01, 9.18573650e+00,
5.57249123e-02, 2.58102886e+01, 2.44395609e+01, 1.03296056e+01,
-1.58007899e-01, 5.26611604e+01, 4.75545775e+01, -1.65485667e+00,
6.46791029e+01, 7.33179672e-02, 1.08647001e+01, -1.29507877e+00,
2.83407180e+01, 1.66902153e+00, -2.59591351e-01, 4.15544195e+01,
-2.45743064e-01, 5.02004600e+01, 2.07908923e+01, 1.00445084e+02,
-2.30934530e-01, 8.28041754e+01, 6.84603638e+01, 4.93572046e+01,
1.10794335e+01, 5.82696078e+00, 2.57335980e+00, 5.92184340e-02,
1.39292919e-02, 5.70872688e+01, 1.98084761e-01, 1.80787078e+01,
3.37324479e+01, 3.10348253e+01, 4.64805350e+00, 1.33254199e+01,
-7.12845783e-01, 2.83681158e+01, -2.54977217e-01, 1.50399299e+00,
-2.65096981e+00, 5.52687929e+01, 1.24608519e+00, -2.07339023e+00,
2.15280473e+01, 4.16516766e+01, -1.32687315e+00, -7.77816688e-01,
-1.11057585e+00, 3.94740357e+01, 9.35678393e-01, 4.98835954e+01,
7.21672064e-01, -1.12905177e+00, -5.24520266e-01, 4.89374561e-01,
7.33921370e+00, 4.18364538e+01, -2.40325398e-01, 3.29684501e+01,
7.31587477e+01, 4.44263311e-01, -3.60966166e-01, 1.04274169e+02,
-1.08106333e+00, 6.15935607e-01, 3.52020291e+01, 6.39846908e+01,
3.26133022e-01, -1.25111358e+00, 5.74452939e+01, 3.65038363e+01,
-5.22723021e-01, 1.98193555e+01, 2.14581715e+02, -1.40846130e+00,
-1.55662917e+00, 6.01959038e+01, -1.28042935e+00, 1.75479418e+00,
3.99319672e+01, 1.04210976e+02, 2.11017467e-01, -9.67131119e-02,
4.15817490e+00, 3.99136114e-01, -3.76347024e-02, 1.10330188e+00,
1.14227649e-01, 1.50301761e-01, 6.31935479e+01, 1.34684251e+01,
5.87727582e+01, 5.93318059e+01, -1.34818542e+00, 7.43264094e-01,
1.26272946e+01, 3.80599387e-01, 1.84339331e-02, 5.67971095e-01,
-2.75828890e-01, 6.52733524e+01, 1.52913838e+01, 1.01548888e+02,
4.08252756e-01, 9.20331091e+01, 1.02915564e+00, 1.01133914e+01,
2.98162802e+01, 9.82690984e-01, 1.66547444e+00, 1.82581408e+01,
-1.84087423e+00, -1.27957697e+00, 1.09184425e+00, 8.09158300e+01,
5.17659020e-01, 1.60650632e+01, 1.86766764e-01, -7.55382932e-01,
9.02787127e+00, -1.40666110e+00, 3.81134789e+01, -1.35168461e+00,
-9.75873253e-01, 5.37903456e+01, -9.49398889e-01, 2.73826778e+01,
4.93317901e-01, 4.27184385e+00, -8.58357780e-01, 7.00309879e-01,
-5.75637826e-01, 3.91745640e+01, 2.56008454e+00, -9.60598997e-02,
1.14927333e+00, -7.03176425e-01, 7.60889028e+01, 1.13391785e+01,
-6.26967058e-01, 1.09636746e+02, 3.69662321e+01, -5.62466776e-01,
3.35971777e+01, 4.23290339e+01, 6.21809962e-01, 1.34528632e+01,
2.77490114e+01, -2.47518636e-01, 8.67392650e+00, 6.20672098e-01,
2.05359534e+02, 9.19366707e+00, 3.80197851e-01, 1.04310509e+01,
8.11139396e+00, 1.08078073e+00, 2.49390731e+01, 1.85169074e+01,
3.35011023e+01, -6.14828580e-02, 8.43396125e+01, 1.70930427e+01,
2.73798311e+01, -1.27674858e+00, 7.77179029e+01, 1.13890736e+02,
1.28881450e+01, 2.75019871e+01, 3.54570631e+01, 4.56850498e+01,
9.38283806e-01, 4.18617249e+01, 9.61207769e-02, 5.88341198e+01,
-4.34496227e-01, 1.74726774e+01, 2.22133772e-01, 1.11095624e+02,
6.38660478e+01, 8.02964325e+01, 3.25842082e+01, 1.91862309e+01,
1.44697788e+00, 1.96554777e-01, 3.70896444e+01, 1.01181811e+01,
2.67050266e-01, 8.44466714e+00, 3.02996058e+01, 1.06548038e+00,
-5.17288450e-01, 4.19938171e+00, 2.29889812e+00, 9.64601871e+01,
1.99688389e+01, 4.35057524e+01, 1.57957215e+00, -5.22860027e-01,
2.31667894e+01, -2.81784609e-01, 4.91509552e+01, -9.18651946e-01,
5.84600305e+01, -7.67797565e-01, 4.86931946e+01, 2.34214733e-01,
5.17423260e+01, 1.79986887e+01, 5.41889089e+01, 3.62268791e+01,
3.89907169e+01, 1.13376845e+02, 7.30223401e+00, 6.10600762e+01,
5.68090106e+01, 3.00041252e+01, 2.89168644e-01, 2.45530014e+00,
-6.37739984e-01, 3.63630036e+01, -6.23140526e-01, 4.60559502e+00,
8.82810085e+01, 1.18901653e+00, 1.42050425e+00, 3.46602959e-01,
-8.32355573e-01, 2.91068583e+01, 2.07953860e+01, 6.32931818e-01,
5.67307179e+01, 2.60236926e+01, 5.91512652e+01, 9.37678393e+00,
7.05373709e+01, 1.04075165e+02, 5.46855779e+01, 3.34456790e-01,
7.50952173e+00, 9.81238915e+00, -1.76947227e-01, -7.98297245e-01,
-1.37931923e+00, 9.05900531e+00, 2.36846302e+01, 1.79455786e+00,
-5.17611299e-01, 2.23787952e-01, 2.27722121e+01, 3.38545236e+01,
2.52693243e+00, -5.30868773e-01, -4.89439443e-01, 4.10845235e+01,
1.38364420e+02, 3.33142153e+00, 5.83928185e-01, 1.37773186e+01,
8.39384992e+01, 4.97237317e+01, 3.06129262e+01, 5.07274031e-01,
1.01233066e+02, 6.74484791e+00, 8.16545686e+01, 3.57240104e+01,
3.07790020e+00, 1.46713686e-01, 1.20650897e+00, 6.23745591e+00,
3.68673309e-01, -3.93338812e-01, 9.98390819e+00, 1.27845186e+00,
1.10046846e+01, 2.77530573e+01, 6.89558569e+01, 7.46253566e-01,
1.31758718e+01, 3.27233434e+01, -3.07778235e-01, 1.29516575e+01,
5.88581157e+01, 1.57745328e+00, 8.09891127e-01, 2.79021526e-01,
6.07896510e-01, 1.16786284e+02, 2.16282543e+01, 6.61882507e+00,
1.07363175e+00, 4.56163361e+01, 5.84334713e+01, -7.00120815e-01,
6.50238090e+01, 4.04713899e+01, -5.58921847e-01, 3.77211875e-01,
4.70644844e+01, 1.03148048e+02, 1.93713395e+00, 1.02059968e+01,
-1.44801390e+00, 3.16304991e+00, 1.85016890e+01, 1.37667145e+01,
1.14316375e+02, 4.61608040e+00, 1.07219611e+00, -5.64078631e-01,
5.70811884e+01, 3.52930388e+01, 1.33897421e+02, -9.91758638e-02,
8.32627671e+01, -1.07008477e+00, -1.52552517e+00, -6.91908070e-01,
3.24545253e+01, 3.26034024e+01, 5.47388350e+00, 3.52055397e-01,
6.97386061e+01, 6.90124040e+01, -8.21511784e-02, 7.03871380e+01,
3.42725346e-01, 2.84477548e+01, 5.25396229e+00, 1.10229165e+02,
2.78871522e+01, 5.16295307e+01, 2.35367064e+01, 5.12328476e+01,
6.26622340e+01, 1.44011722e+00, 5.79370306e+01, 1.80094043e+00,
3.94910677e+00, 2.34062385e+01, 5.81697112e+00, -6.81051657e-01,
1.16159658e+01, 2.00766350e+01, -4.46183433e-01, 4.61046771e+01,
1.00084701e+02, -2.42387933e+00, 3.54931376e+01, 7.60414656e-01,
8.20522539e-01, 3.38284438e+01, -9.66976143e-01, 1.39267160e+01,
1.29677682e+01, -1.15836469e+00, 4.08262399e+01, 5.98465277e+00,
1.43087165e+01, 2.68858390e-02, 2.71793446e+01, 4.22579496e+01,
1.22059505e+02, -6.81984248e-01, 9.39162928e+00, 6.40787645e+00,
1.39675424e+01, 6.40842861e-01, 5.87374343e+01, 5.72582781e-01])
X.nnz / (X.shape[0] * X.shape[1])
0.01
# Compute optimal parameter vector.
n, d = X.shape
A = sp.linalg.LinearOperator((d, d), matvec=lambda v: X.T @ (X @ v))
b = X.T @ ys
w = sp.linalg.cg(A, b)[0]
w
array([4.64339975e-02, 3.26604917e-01, 2.29059762e+00, 2.97799922e+00,
4.43162378e+00, 4.36446042e+00, 6.21333225e+00, 6.55339954e+00,
7.05355659e+00, 1.03196805e+01, 1.10572575e+01, 1.15858278e+01,
1.17330927e+01, 1.33788421e+01, 1.37388481e+01, 1.54351381e+01,
1.57893995e+01, 1.64644072e+01, 1.72860014e+01, 1.92960255e+01,
2.07635357e+01, 2.11645421e+01, 2.22330136e+01, 2.42350792e+01,
2.30280771e+01, 2.53374573e+01, 2.57806693e+01, 2.71485673e+01,
2.78002859e+01, 2.84697530e+01, 3.10263660e+01, 3.07551333e+01,
3.24995461e+01, 3.37672839e+01, 3.38745444e+01, 3.50316613e+01,
3.63488936e+01, 3.82325952e+01, 3.76995083e+01, 3.81634254e+01,
3.99686573e+01, 4.14135827e+01, 4.15741918e+01, 4.23832713e+01,
4.33668312e+01, 4.48930318e+01, 4.77707684e+01, 4.46614983e+01,
4.73798992e+01, 4.90847053e+01, 4.98237101e+01, 5.18169738e+01,
5.11428677e+01, 5.30883577e+01, 5.32198253e+01, 5.48964684e+01,
5.54724743e+01, 5.81113882e+01, 5.78044648e+01, 5.82857037e+01,
6.10110395e+01, 6.17879142e+01, 6.14003995e+01, 6.27920372e+01,
6.41256972e+01, 6.48996134e+01, 6.45764947e+01, 6.74807125e+01,
6.83645753e+01, 6.91000515e+01, 7.06680225e+01, 7.14375467e+01,
7.14460225e+01, 7.27029937e+01, 7.41242247e+01, 7.44094776e+01,
7.61913445e+01, 7.68914070e+01, 7.76538667e+01, 7.96193506e+01,
7.97402505e+01, 8.03906231e+01, 8.26255295e+01, 8.37349515e+01,
8.36416654e+01, 8.50963847e+01, 8.59888530e+01, 8.69106689e+01,
8.80390935e+01, 8.92648847e+01, 9.14780571e+01, 9.08513364e+01,
9.08726437e+01, 9.32493484e+01, 9.45125753e+01, 9.46143945e+01,
9.62206304e+01, 9.69323206e+01, 9.84048975e+01, 1.00099921e+02])
The Boston Housing dataset was originally published by Harrison and Rubinfeld in 1978, in a paper titled Hedonic prices and the demand for clean air. The authors used the dataset to investigate the relationship between air quality and housing prices in the Boston metropolitan area. Since then, the dataset has been widely used in machine learning and statistics research as a benchmark for regression tasks.
Each row in the data table contains various attributes of a a district in Boston. The last attribute is the median value of owner occupied homes in the district. We will create regressors that estimate this attribute from other attributes of the district.
Exercise 1: Load the Boston Housing data set to DataFrame and display basic statistics about it!
# Extract column names.
fname = '../_data/housing_names.txt'
columns = []
for line in open(fname):
if line.startswith(' ') and line[4].isdigit():
columns.append(line.split()[1])
columns
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
# Load to DataFrame.
import pandas as pd
df = pd.read_csv('../_data/housing_data.txt', sep='\t', names=columns)
df
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 5.33 | 36.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273.0 | 21.0 | 9.67 | 22.4 |
| 502 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273.0 | 21.0 | 9.08 | 20.6 |
| 503 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273.0 | 21.0 | 5.64 | 23.9 |
| 504 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273.0 | 21.0 | 6.48 | 22.0 |
| 505 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273.0 | 21.0 | 7.88 | 11.9 |
506 rows Ć 13 columns
# Check column data types.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null int64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null int64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 LSTAT 506 non-null float64 12 MEDV 506 non-null float64 dtypes: float64(11), int64(2) memory usage: 51.5 KB
# Basic column statistics.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677083 | 88.9762 |
| ZN | 506.0 | 11.363636 | 23.322453 | 0.00000 | 0.000000 | 0.00000 | 12.500000 | 100.0000 |
| INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |
| CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |
| NOX | 506.0 | 0.554695 | 0.115878 | 0.38500 | 0.449000 | 0.53800 | 0.624000 | 0.8710 |
| RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |
| AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |
| DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |
| RAD | 506.0 | 9.549407 | 8.707259 | 1.00000 | 4.000000 | 5.00000 | 24.000000 | 24.0000 |
| TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |
| PTRATIO | 506.0 | 18.455534 | 2.164946 | 12.60000 | 17.400000 | 19.05000 | 20.200000 | 22.0000 |
| LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |
| MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 | 50.0000 |
Exercise 2: Build a univariate linear model for each column and measure it's RMSE (root mean squared error)! Use the full data set both for for model building and error measurement!
$$\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2}$$y = df['MEDV'].values
y = y - y.mean()
for column in columns[:-1]:
x = df[column].values
x = x - x.mean()
w = (x @ y) / (x @ x) # optimal model parameter
yhat = x * w # prediction
rmse = ((yhat - y)**2).mean()**0.5 # root mean squared error
print(column, rmse)
CRIM 8.467038200100824 ZN 8.570396495772854 INDUS 8.04153105080589 CHAS 9.045800910882107 NOX 8.306881987569504 RM 6.603071389222562 AGE 8.510228018625197 DIS 8.896422965780745 RAD 8.492632800301259 TAX 8.117097716353989 PTRATIO 7.915314271320455 LSTAT 6.20346413142642
Exercise 3: Build a multivariate linear model and measure its RMSE! Use the full data set both for for model building and error measurement!
import numpy as np
X = df[columns[:-1]].values
y = df['MEDV'].values
w = np.linalg.solve(X.T @ X, X.T @ y) # optimal parameter vector
yhat = X @ w # prediction
rmse = ((yhat - y)**2).mean()**0.5 # root mean squared error
rmse
5.065952606279903
Exercise 4: Introduce a bias term into the multivariate linear model!
X2 = np.hstack([X, np.ones((X.shape[0], 1))])
y = df['MEDV'].values
w = np.linalg.solve(X2.T @ X2, X2.T @ y) # optimal parameter vector
yhat = X2 @ w # prediction
rmse = ((yhat - y)**2).mean()**0.5 # root mean squared error
rmse
4.735998462783738
# Display the weights associated with the features.
pd.Series(w[:-1], columns[:-1])
CRIM -0.121389 ZN 0.046963 INDUS 0.013468 CHAS 2.839993 NOX -18.758022 RM 3.658119 AGE 0.003611 DIS -1.490754 RAD 0.289405 TAX -0.012682 PTRATIO -0.937533 LSTAT -0.552019 dtype: float64
# Display the weights associated with the features, after scaling the columns.
X3 = np.hstack([X / X.std(axis=0), np.ones((X.shape[0], 1))])
y = df['MEDV'].values
w = np.linalg.solve(X3.T @ X3, X3.T @ y) # optimal parameter vector
pd.Series(w[:-1], columns[:-1])
CRIM -1.043097 ZN 1.094220 INDUS 0.092302 CHAS 0.720628 NOX -2.171487 RM 2.567716 AGE 0.101537 DIS -3.135992 RAD 2.517429 TAX -2.135271 PTRATIO -2.027701 LSTAT -3.938105 dtype: float64
Execrice 5: Repeat the previous experiment so that the model is built on a training set and evaluated on a distinct test set!
n = df.shape[0]
rs = np.random.RandomState(42)
idxs = rs.permutation(n)
s = int(n * 0.7)
tr = idxs[:s] # indices of the training set
te = idxs[s:] # indices of the test set
len(tr), len(te)
(354, 152)
X2 = np.hstack([X, np.ones((X.shape[0], 1))])
y = df['MEDV'].values
w = np.linalg.solve(X2[tr].T @ X2[tr], X2[tr].T @ y[tr]) # optimal parameter vector: use training set only!
yhat = X2 @ w # prediction
rmse_te = ((yhat[te] - y[te])**2).mean()**0.5 # root mean squared error: use test set only!
rmse_te
4.82601778299211
Scikit-learn is a popular open-source machine learning library in Python. It provides a range of supervised and unsupervised learning algorithms for tasks such as classification, regression, clustering, and dimensionality reduction. Scikit-learn also includes tools for model selection, preprocessing, and evaluation, making it a comprehensive library for building and evaluating machine learning models.
sklearn.fit() and a predict() method.fit() methods requires 2 parameters: the input matrix $X$ and a target vector $y$. Calling the fit() method trains a model on the given data.predict() method requires an input matrix $X$ and returns the prediction of the trained model for the given inputs.Execrice 6: Repeat the previous experiments using scikit-learn!
# Query the version number of scikit-learn.
import sklearn
sklearn.__version__
'1.2.1'
# This is how we could create a train-test split with scikit-learn.
# However, we will keep using the original split to make the results comparable.
from sklearn.model_selection import ShuffleSplit
tr2, te2 = next(ShuffleSplit(random_state=42, test_size=0.3).split(X))
# In scikit-learn, the closest thing to RMSE is mean_squared_error.
from sklearn.metrics import mean_squared_error
# Univariate models, no train-test split.
from sklearn.linear_model import LinearRegression
y = df['MEDV'].values
for column in columns[:-1]:
x = df[[column]].values
re = LinearRegression()
re.fit(x, y)
yhat = re.predict(x)
rmse = mean_squared_error(y, yhat)**0.5
print(column, rmse)
CRIM 8.467038200100824 ZN 8.570396495772854 INDUS 8.04153105080589 CHAS 9.045800910882107 NOX 8.306881987569504 RM 6.603071389222561 AGE 8.510228018625199 DIS 8.896422965780747 RAD 8.492632800301259 TAX 8.117097716353987 PTRATIO 7.915314271320455 LSTAT 6.20346413142642
# Multivariate model without bias, no train-test split.
X = df[columns[:-1]].values
y = df['MEDV'].values
re = LinearRegression(fit_intercept=False)
re.fit(X, y)
yhat = re.predict(X)
rmse = mean_squared_error(y, yhat)**0.5
print(rmse)
5.065952606279903
# Multivariate model with bias.
X = df[columns[:-1]].values
y = df['MEDV'].values
re = LinearRegression()
re.fit(X[tr], y[tr]) # use training set
yhat = re.predict(X)
rmse = mean_squared_error(y[te], yhat[te])**0.5 # use test set
print(rmse)
4.826017782992097
Logistic regression is an algorithm for classification.
Logistic regression builds a linear model that separates the two classes by a hyperplane.
For simplicity, let' discuss first the univariate case, with a binary target variable and no bias term.
import pandas as pd
data = [
{'name': 'David Beckham', 'study_time': 0, 'result': 0},
{'name': 'Jessica Scott', 'study_time': 7, 'result': 1},
{'name': 'Jack Johnson', 'study_time': 3.5, 'result': 0},
{'name': 'Scunner Campbell', 'study_time': 6, 'result': 0},
{'name': 'Plain Jane ', 'study_time': 3, 'result': 1},
{'name': 'Archie Gillis', 'study_time': 15, 'result': 1},
]
df = pd.DataFrame(data)
df
| name | study_time | result | |
|---|---|---|---|
| 0 | David Beckham | 0.0 | 0 |
| 1 | Jessica Scott | 7.0 | 1 |
| 2 | Jack Johnson | 3.5 | 0 |
| 3 | Scunner Campbell | 6.0 | 0 |
| 4 | Plain Jane | 3.0 | 1 |
| 5 | Archie Gillis | 15.0 | 1 |
The above toy data set contains 2 attributes of 6 students:
Exercise 1: Train a univariate logistic regression model that estimates the result column from the study_time column!
x = df['study_time'].values # input vector
y = df['result'].values # target vector
# subtract mean from input
xm = x.mean()
x -= xm
print(x, y)
[-5.75 1.25 -2.25 0.25 -2.75 9.25] [0 1 0 0 1 1]
import numpy as np
def sigmoid(t):
return 1 / (1 + np.exp(-t))
w = 0 # initial model parameter
for it in range(10):
yhat = sigmoid(x * w)
ce = -(np.log(yhat) * y + np.log(1 - yhat) * (1 - y)).sum()
ce_i = ((yhat - y) * x).sum()
ce_ii = (yhat * (1 - yhat) * x**2).sum()
print(f'w={w:.9f} CE(w)={ce:.9f}')
w -= ce_i / ce_ii # Newton step
w=0.000000000 CE(w)=4.158883083 w=0.233301976 CE(w)=3.151547142 w=0.328737831 CE(w)=3.065299756 w=0.358540912 CE(w)=3.060405283 w=0.360841429 CE(w)=3.060380944 w=0.360853783 CE(w)=3.060380943 w=0.360853783 CE(w)=3.060380943 w=0.360853783 CE(w)=3.060380943 w=0.360853783 CE(w)=3.060380943 w=0.360853783 CE(w)=3.060380943
# Display the probability of passing the exam (according to the model)
# as a function of the study hours!
import matplotlib.pyplot as plt
x2 = np.arange(0, 24, 0.5)
yhat2 = sigmoid((x2 - xm) * w)
plt.figure(figsize=(16, 6))
plt.plot(x2, yhat2)
plt.grid(True)
plt.xlabel('study hours')
plt.ylabel('P(passing the exam)')
Text(0, 0.5, 'P(passing the exam)')
The Wisconsin Breast Cancer data set contains the attributes of 699 suspicious lesions in tissue microscopy images. The raw data is contained in wisconsin_data.txt, the description can be read in wisconsin_names.txt. The task is to estimate if the lesion is malicious (4) or benign (2), based on the image attributes of the lesion. Therefore the task is a binary classification problem.
Exercise 2: Train a univariate logistic regression model for each input feature separately, and measure the average cross-entropy of the models! Use the full data set both for training and evaluation!
# Column names.
names = [
'Sample_code_number',
'Clump_Thickness',
'Uniformity_of_Cell_Size',
'Uniformity_of_Cell_Shape',
'Marginal_Adhesion',
'Single_Epithelial_Cell_Size',
'Bare_Nuclei',
'Bland_Chromatin',
'Normal_Nucleoli',
'Mitoses',
'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 699 entries, 0 to 698 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sample_code_number 699 non-null int64 1 Clump_Thickness 699 non-null int64 2 Uniformity_of_Cell_Size 699 non-null int64 3 Uniformity_of_Cell_Shape 699 non-null int64 4 Marginal_Adhesion 699 non-null int64 5 Single_Epithelial_Cell_Size 699 non-null int64 6 Bare_Nuclei 683 non-null float64 7 Bland_Chromatin 699 non-null int64 8 Normal_Nucleoli 699 non-null int64 9 Mitoses 699 non-null int64 10 Class 699 non-null int64 dtypes: float64(1), int64(10) memory usage: 60.2 KB
# substitute mean for the missing Bare_Nuclei values
df['Bare_Nuclei'].fillna(df['Bare_Nuclei'].mean(), inplace=True)
# target vector
y = df['Class'].values // 2 - 1
def logreg_predict(x, w):
return sigmoid(x * w)
def logreg_fit(x, y, niter=10):
w = 0 # initial model parameter
for it in range(niter):
yhat = logreg_predict(x, w)
ce_i = ((yhat - y) * x).sum()
ce_ii = (yhat * (1 - yhat) * x**2).sum()
w -= ce_i / ce_ii # Newton step
return w
def avg_cross_entropy(y, yhat):
ce = -(np.log(yhat) * y + np.log(1 - yhat) * (1 - y)).sum()
return ce / len(y)
for column in names[1:-1]:
se = df[column]
x = (se - se.mean()).values # input vector
w = logreg_fit(x, y)
yhat = logreg_predict(x, w)
ace = avg_cross_entropy(y, yhat)
print(f'{column:30} AvgCE={ace:.9f} w={w:.9f}')
Clump_Thickness AvgCE=0.390611001 w=0.905912818 Uniformity_of_Cell_Size AvgCE=0.199170002 w=1.572130112 Uniformity_of_Cell_Shape AvgCE=0.211668062 w=1.524221657 Marginal_Adhesion AvgCE=0.359316116 w=1.107418964 Single_Epithelial_Cell_Size AvgCE=0.351132841 w=1.537879946 Bare_Nuclei AvgCE=0.265288430 w=0.979151501 Bland_Chromatin AvgCE=0.303804235 w=1.542319399 Normal_Nucleoli AvgCE=0.355869859 w=1.004452002 Mitoses AvgCE=0.527654735 w=1.711238337
Exercise 4: Repeat the previous experiment using scikit-learn!
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss # average cross entropy
for column in names[1:-1]:
se = df[column]
x = (se - se.mean()).values # input vector
cl = LogisticRegression(fit_intercept=False, C=1e12) # define model
X = x.reshape((-1, 1))
cl.fit(X, y) # train model
w = cl.coef_[0, 0] # extract model parameter
yhat = cl.predict_proba(X)[:, 1] # make probability prediction
ace = log_loss(y, yhat) # compute average cross entropy
print(f'{column:30} AvgCE={ace:.9f} w={w:.9f}')
Clump_Thickness AvgCE=0.390611001 w=0.905912814 Uniformity_of_Cell_Size AvgCE=0.199170002 w=1.572130109 Uniformity_of_Cell_Shape AvgCE=0.211668062 w=1.524221656 Marginal_Adhesion AvgCE=0.359316116 w=1.107418964 Single_Epithelial_Cell_Size AvgCE=0.351132841 w=1.537879938 Bare_Nuclei AvgCE=0.265288430 w=0.979151504 Bland_Chromatin AvgCE=0.303804235 w=1.542318627 Normal_Nucleoli AvgCE=0.355869859 w=1.004451975 Mitoses AvgCE=0.527654735 w=1.711237945
Exercise 5: Introduce a 70-30% train-test split and re-run the experiment!
# train-test split
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(df))
for column in names[1:-1]:
se = df[column]
x = (se - se.mean()).values # input vector
cl = LogisticRegression(fit_intercept=False, C=1e12) # define model
X = x.reshape((-1, 1))
cl.fit(X[tr], y[tr]) # train model (on training set)
w = cl.coef_[0, 0] # extract model parameter
yhat = cl.predict_proba(X)[:, 1] # make probability prediction
ace = log_loss(y[te], yhat[te]) # compute average cross entropy (on test set)
print(f'{column:30} AvgCE={ace:.9f} w={w:.9f}')
Clump_Thickness AvgCE=0.335747286 w=0.826808279 Uniformity_of_Cell_Size AvgCE=0.176748276 w=1.515755629 Uniformity_of_Cell_Shape AvgCE=0.217456999 w=1.535220556 Marginal_Adhesion AvgCE=0.300066855 w=1.024555296 Single_Epithelial_Cell_Size AvgCE=0.350178198 w=1.527215223 Bare_Nuclei AvgCE=0.311173420 w=1.008497031 Bland_Chromatin AvgCE=0.306641746 w=1.453926694 Normal_Nucleoli AvgCE=0.362623274 w=1.024082073 Mitoses AvgCE=0.491826787 w=1.562289264
The previous approach can be generalized to allows multiple input features.
Similarly to linear regression, the bias term can be handled by introducing a constant 1 feature.
Exercise 6: Train a multivariate logistic regression model on the training set and measure its cross-entropy on the test set! Implement the training algorithm without using scikit-learn!
X = df[names[1:-1]].values # input matrix
y = df['Class'].values // 2 - 1 # target vector
def cross_entropy(y, yhat):
return -np.log(yhat) @ y - np.log(1 - yhat) @ (1 - y)
def logreg_m_fit(X, y, niter=10):
w = np.zeros(X.shape[1]) # initial model
for it in range(niter):
yhat = sigmoid(X @ w) # prediction
ce = cross_entropy(y, yhat) # cross-entropy
g = X.T @ (yhat - y) # gradient of cross-entropy
H = X.T * (yhat * (1 - yhat)) @ X # Hessian matrix of cross-entropy
print(it, ce)
w -= np.linalg.solve(H, g) # Newton-step
return w
w = logreg_m_fit(X[tr], y[tr])
yhat = sigmoid(X @ w)
cross_entropy(y[te], yhat[te])
0 338.9489712938133 1 219.31492649580565 2 194.84385525659366 3 189.2334455179607 4 188.79436565306744 5 188.79058752143467 6 188.7905871820355 7 188.79058718203555 8 188.79058718203555 9 188.79058718203552
92.3437344981877
w
array([-0.37775352, 0.81513883, 0.24794682, -0.00703306, -0.7267206 ,
0.585522 , -0.43993472, 0.32078896, -0.23507939])
Exercise 7: Use scikit-learn for training the model and compare the results against the previous experiment's results!
from sklearn.linear_model import LogisticRegression
cl = LogisticRegression(fit_intercept=False, C=1e12) # no bias, no regularization
cl.fit(X[tr], y[tr])
yhat = cl.predict_proba(X)[:, 1]
ce = cross_entropy(y[te], yhat[te])
print(ce, ce / len(te))
28.330792605335226 0.4105911971787714
cl.coef_[0]
array([-0.37775325, 0.81513397, 0.24796104, -0.00703901, -0.72671579,
0.58551021, -0.43993414, 0.32079127, -0.23508148])
Exercise 8: Replace the train-test split to 10-fold cross valiadtion and re-run the last experiment!
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss
cv = KFold(10, shuffle=True, random_state=42)
scores = []
for tr, te in cv.split(X):
cl = LogisticRegression(fit_intercept=False, C=1e12) # no bias, no regularization
cl.fit(X[tr], y[tr])
yhat = cl.predict_proba(X)[:, 1]
score = log_loss(y[te], yhat[te])
scores.append(score)
scores
[0.3103854498633047, 0.5013502260500776, 0.492728969735146, 0.34584544309166304, 0.5820155518006203, 0.3244320020003381, 0.4848963100207666, 0.3690964445150554, 0.38712987311278146, 0.41059119717877135]
np.mean(scores)
0.4208471467368525
Regularization) techniques in machine learning are used to prevent or reduce overfitting of a model to the training data. The fundamental idea is to simplify the model, making it less accurate on the training set but more accurate on unseen data. Essentially, the goal of regularization is to set a good balance between model complexity and generalization performance.
Some specific regularization techniques are L1 regularization, L2 regularization, noise injection, data augmentation, early stopping, dropout and ensembling. In this lecture we will discuss L2 regularization that can be seen as simple and well understood form of regularization.
The file smsspam.txt contains labeled SMS messages.
Label "ham" indicates normal messages, label "spam" indicates undesired messages.
The goal is to build a classifier that estimates the label based on the text content of the SMS.
The error metric should be the average cross-entropy (aka log_loss).
# Load the raw data into a DataFrame!
import pandas as pd
df = pd.read_csv('../_data/smsspam.txt', sep='\t', names=['label', 'message'])
df
| label | message | |
|---|---|---|
| 0 | ham | Go until jurong point, crazy.. Available only ... |
| 1 | ham | Ok lar... Joking wif u oni... |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
| 3 | ham | U dun say so early hor... U c already then say... |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... |
| ... | ... | ... |
| 5567 | spam | This is the 2nd time we have tried 2 contact u... |
| 5568 | ham | Will ü b going to esplanade fr home? |
| 5569 | ham | Pity, * was in mood for that. So...any other s... |
| 5570 | ham | The guy did some bitching but I acted like i'd... |
| 5571 | ham | Rofl. Its true to its name |
5572 rows Ć 2 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5572 entries, 0 to 5571 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 label 5572 non-null object 1 message 5572 non-null object dtypes: object(2) memory usage: 87.2+ KB
# Distribution of labels.
df.groupby('label').size()
label ham 4825 spam 747 dtype: int64
Exercise 1: Compute the bag-of-words representation of the following example documents!
# Example documents.
documents = [
'John likes to watch movies. Mary likes movies too.',
'Mary also likes to watch football games.'
]
# Set of all words.
import string
def tokenize(doc):
return [w.strip(string.punctuation) for w in doc.lower().split()]
words = set()
for doc in documents:
for w in tokenize(doc):
words.add(w)
words
{'also',
'football',
'games',
'john',
'likes',
'mary',
'movies',
'to',
'too',
'watch'}
# Create dictionary of word indices.
word_to_idx = {w: i for i, w in enumerate(sorted(words))}
word_to_idx
{'also': 0,
'football': 1,
'games': 2,
'john': 3,
'likes': 4,
'mary': 5,
'movies': 6,
'to': 7,
'too': 8,
'watch': 9}
# Build input matrix.
import numpy as np
X = np.zeros((len(documents), len(words)), dtype='int64')
for i, doc in enumerate(documents):
for w in tokenize(doc):
j = word_to_idx[w]
X[i, j] = 1
X
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 1, 1, 0, 1, 0, 1]])
# Shorter solution with sklearn's CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer
X_sp = CountVectorizer(binary=True).fit_transform(documents) # fit(), then transform()
X_sp
<2x10 sparse matrix of type '<class 'numpy.int64'>' with 14 stored elements in Compressed Sparse Row format>
X_sp.toarray()
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 1, 1, 0, 1, 0, 1]])
Some standard storage formats:
Typically, the COO format is the most convnient for creating/building matrices. However, in computations, the CSR or CSC format can be more efficient.
# Create an example COO matrix!
import scipy.sparse as sp
data = [1, 1, 1, 1, 1]
i = [0, 0, 1, 2, 2] # row indices
j = [0, 3, 1, 4, 5] # column indices
A = sp.coo_matrix((data, (i, j)))
A
<3x6 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in COOrdinate format>
# Number of nenzeros.
A.nnz
5
# Internal representation (.row, .col, .data).
A.row
array([0, 0, 1, 2, 2], dtype=int32)
A.col
array([0, 3, 1, 4, 5], dtype=int32)
A.data
array([1, 1, 1, 1, 1])
# Conversion to NumPy array.
A.toarray()
array([[1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1]])
# Conversion to CSR format.
B = A.tocsr()
B
<3x6 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in Compressed Sparse Row format>
# Internal representation (.indices, .indptr, .data)
B.indices
array([0, 3, 1, 4, 5], dtype=int32)
B.indptr
array([0, 2, 3, 5], dtype=int32)
B.data
array([1, 1, 1, 1, 1])
# Selecting rows.
B[0]
<1x6 sparse matrix of type '<class 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>
# Try to select rows from a COO matrix.
A[0]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [21], in <cell line: 2>() 1 # Try to select rows from a COO matrix. ----> 2 A[0] TypeError: 'coo_matrix' object is not subscriptable
# Conversion to CSC format.
C = A.tocsc()
C[:, 0]
<3x1 sparse matrix of type '<class 'numpy.int64'>' with 1 stored elements in Compressed Sparse Column format>
# What happens if we transpose a CSR matrix?
B.T
<6x3 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in Compressed Sparse Column format>
Exercise 2: Prepare the input matrix $X$ (sparse) and the target vector $y$ (dense)!
X = CountVectorizer(binary=True).fit_transform(df['message'])
X
<5572x8713 sparse matrix of type '<class 'numpy.int64'>' with 74169 stored elements in Compressed Sparse Row format>
y = (df['label'] == 'spam').values.astype('int64')
y
array([0, 0, 1, ..., 0, 0, 0])
Exercise 3: Measure the training and test log_loss of logistic regression, using 5-fold cross-validation! (Scikit-learn's LogisticRegression and LinearRegression accept sparse matrices as the input matrix.)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold
cv = KFold(5, shuffle=True, random_state=42)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
cl = LogisticRegression()
cl.fit(X[tr], y[tr])
yhat = cl.predict_proba(X)[:, 1]
scores_tr.append(log_loss(y[tr], yhat[tr]))
scores_te.append(log_loss(y[te], yhat[te]))
scores_tr, scores_te
([0.020030998513910356, 0.02026798586333316, 0.018731273244436235, 0.019306333644836884, 0.01860439768051876], [0.04673015055948079, 0.042830755193124764, 0.060999083223150466, 0.06704297424445832, 0.06559486598950316])
# average training loss
np.mean(scores_tr)
0.01938819778940708
# average test loss
np.mean(scores_te)
0.0566395658419435
Exercise 4: Plot the training and test error of logistic regression, as the function of the regularization coefficient $\lambda$! The relation between $\lambda$ and scikit-learn's $C$ parameter is $C = 1 / \lambda$.
# Model evaluation function.
def evaluate(cl, X, y):
cv = KFold(5, shuffle=True, random_state=42)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
cl.fit(X[tr], y[tr])
yhat = cl.predict_proba(X)[:, 1]
scores_tr.append(log_loss(y[tr], yhat[tr]))
scores_te.append(log_loss(y[te], yhat[te]))
return {
'score_tr': np.mean(scores_tr),
'score_te': np.mean(scores_te)
}
# Trying different lambda values.
data = []
for lambd in [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100]:
print(lambd)
cl = LogisticRegression(C=1 / lambd)
res = evaluate(cl, X, y)
res['lambda'] = lambd
data.append(res)
1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100
df_res = pd.DataFrame(data).set_index('lambda')
df_res.plot(figsize=(12, 6), grid=True, logx=True)
<AxesSubplot: xlabel='lambda'>
# optimal lambda value
df_res['score_te'].idxmin()
0.1
Exercise 5: Train ridge regression models on the Boston Housing data set that predict the MEDV column. The error metric should be RMSE. Plot the training and test error of ridge regression, as the function of the regularization coefficient $\alpha$! Apply 7-fold cross-validation!
# Load the Boston Housing data set.
columns = []
for l in open('../_data/housing_names.txt'):
if l[:4] == ' ' and l[4].isdigit():
columns.append(l.split()[1])
df = pd.read_csv('../_data/housing_data.txt', sep='\t', names=columns)
X = df[df.columns[:-1]].values # input matrix
y = df['MEDV'].values # target vector
X.shape
(506, 12)
y.shape
(506,)
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# Model evaluation function.
def evaluate(re, X, y):
cv = KFold(7, shuffle=True, random_state=42)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
re.fit(X[tr], y[tr])
yhat = re.predict(X)
scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)
return {
'score_tr': np.mean(scores_tr),
'score_te': np.mean(scores_te)
}
evaluate(Ridge(), X, y)
{'score_tr': 4.741699999401361, 'score_te': 4.909347897956921}
# Trying different alpha values.
data = []
for alpha in [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100]:
print(alpha)
re = Ridge(alpha=alpha)
res = evaluate(re, X, y)
res['alpha'] = alpha
data.append(res)
1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100
df_res = pd.DataFrame(data).set_index('alpha')
df_res.plot(figsize=(12, 6), grid=True, logx=True)
<AxesSubplot: xlabel='alpha'>
The decision tree is an old, but still relevant nonlinear learning algorithm. The leaves of the tree represent distinct subsets of the training data. The other nodes compare a given attribute to a threshold value (e.g. is the body temperature > 37 °C). The branches starting from the node are associated with the two possible outcomes of the comparison.
In this notebook, we will prepare the simplest version of the decision tree called decision stump, and we will test it on the Boston Housing data set. Moreover, we will explore the capabilities of scikit-learn's decision tree algorithm.
Exercise 2: Implement the training of the decision stump regressor from scratch, and measure the model's root mean squared error (RMSE) on the full Boston Housing data set!
# Loading the data.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
X = df[df.columns[:-1]].values # input matrix
y = df['MEDV'].values # target vector
X.shape
(506, 12)
y.shape
(506,)
# Training algorithm.
import numpy as np
n, d = X.shape
sse_min = np.inf
for j in range(d): # iterate over all features
x = X[:, j] # select j-th column
# sort x and y by x
idxs = x.argsort()
x_sorted = x[idxs]
y_sorted = y[idxs]
# find optimal threshold value
for i in range(n - 1):
t = (x_sorted[i] + x_sorted[i + 1]) / 2
y_L = y_sorted[:(i + 1)]
y_R = y_sorted[(i + 1):]
yhat_L = y_L.mean() # prediction for left branch
yhat_R = y_R.mean() # prediction for right branch
sse_L = ((y_L - yhat_L)**2).sum()
sse_R = ((y_R - yhat_R)**2).sum()
sse = sse_L + sse_R # sum of squared errors
# (instead of re-computing SSE, updating it would be more efficient!)
if sse < sse_min:
sse_min = sse
t_opt = t
j_opt = j
print(j, t, sse)
0 0.007690000000000001 42714.138495049505 0 0.01001 42654.06214285715 0 0.01306 42607.60066733068 0 0.013354999999999999 42487.76150099801 0 0.014065 42237.84195247638 0 0.014355 42153.999678714856 0 0.014700000000000001 42111.73079365079 0 0.015195 41739.21143434343 0 0.016235 41403.70269905533 0 0.017435 41349.519812763305 0 0.01824 41236.26263646922 0 0.01958 41205.95119132654 0 0.020319999999999998 40890.95242258652 0 0.02182 40635.159448559665 0 0.023425 40556.32548257241 0 0.03393 40547.17303427895 0 0.035235 40380.59639339103 0 0.035809999999999995 40254.02162255965 0 0.03637 40253.174434710054 0 0.037215 40116.550764971194 0 0.038195 40053.84481361776 0 0.04158 40050.55321587302 0 0.060615 40026.656188803965 0 0.06128 39928.78982113821 0 0.061399999999999996 39665.01313689411 0 0.06874 39608.649885609964 0 0.068935 39463.265595959594 0 0.06908 39343.75913705584 0 0.06962 39265.912307865525 0 0.070175 39210.16856426781 0 0.078805 39155.07983894646 0 0.07891000000000001 38989.80508618654 0 0.07996 38928.994316416145 0 0.08193 38717.01742815372 0 0.08232500000000001 38678.05333333334 0 0.08338999999999999 38656.59620356234 0 0.083785 38515.85305258467 0 0.08685499999999999 38423.24137486273 0 0.09085499999999999 38407.847881084235 0 0.091335 38224.95304906361 0 0.092825 38221.28124619482 0 0.09673999999999999 38220.88735692883 0 0.10004 38215.88722212965 0 0.10046 38102.51552353896 0 0.530525 38097.114624419206 0 0.53556 37833.89951690821 0 0.538555 37804.16323365696 0 0.540305 37652.42981832914 0 0.54251 37370.19513926763 0 0.553925 37229.301349661284 0 0.576815 37010.035198692814 0 0.584195 36615.80305075868 0 0.600795 36598.24378066379 0 0.61312 36188.70514973798 0 0.61913 36064.39923206056 0 0.625475 35977.49916167862 0 0.66008 35946.86362004487 0 0.66771 35727.589037974685 0 1.92198 35552.068071910115 0 2.079685 35062.63064576166 0 5.680865 34962.9829047229 0 5.71967 34887.16084843596 0 5.776155 34775.12330894309 0 5.82258 34758.50777564349 0 5.84803 34683.80154100392 0 6.59684 34459.1683803519 0 6.68632 34450.12268540669 2 3.97 34179.313795165566 2 3.97 34079.75626356588 2 3.97 33992.9072965188 2 3.97 33823.920771531106 2 3.97 33446.069910274025 2 3.97 32932.259 2 3.97 32822.44286137958 2 6.2 32525.80000681663 2 6.305 32408.905397365066 2 6.41 32232.526407203906 2 6.41 32069.971986668697 2 6.41 31913.84879658385 2 6.41 31735.28989239707 2 6.66 31633.06994724462 5 6.353 31620.265685618728 5 6.365 31597.217758899675 5 6.3725000000000005 31589.026377765174 5 6.3740000000000006 31580.444654914947 5 6.3765 31550.410197210666 5 6.3785 31385.904983006534 5 6.38 31111.927828065407 5 6.381 30905.164949494945 5 6.3825 30894.875389745866 5 6.3965 30881.240154281644 5 6.4030000000000005 30877.875029335995 5 6.4045000000000005 30643.677037641577 5 6.405 30607.31236292624 5 6.405 30376.976001605777 5 6.4055 30269.188327494005 5 6.4085 30129.227205309136 5 6.413 29979.37480946367 5 6.417 29964.76324786325 5 6.417 29733.06464156396 5 6.423 29683.89051816924 5 6.4254999999999995 29513.365627226638 5 6.428 29506.29679528789 5 6.431 29471.24007837721 5 6.4350000000000005 29115.059845077285 5 6.4365000000000006 28895.02630601379 5 6.4375 28869.883910364148 5 6.4475 28846.588411876586 5 6.4535 28792.809893134086 5 6.455 28628.28351240255 5 6.457000000000001 28576.961106549366 5 6.4585 28456.8392882613 5 6.46 28165.899572877355 5 6.466 27819.337393051967 5 6.4725 27549.313126293997 5 6.4775 27434.100434609827 5 6.4815000000000005 27409.97435883494 5 6.484500000000001 27321.33991896775 5 6.486000000000001 27095.753783150183 5 6.4885 27084.825822994208 5 6.4925 27036.90448457792 5 6.5024999999999995 26913.871297576567 5 6.5105 26872.798036377208 5 6.5120000000000005 26872.49322097378 5 6.5145 26753.396314928657 5 6.5205 26704.4599890533 5 6.5315 26430.367471623744 5 6.539 26411.189509893455 5 6.5425 26192.07816142898 5 6.5455000000000005 25826.708856660527 5 6.652 25815.173483749782 5 6.656000000000001 25512.31084920792 5 6.6655 25259.78012157383 5 6.676 25109.893851662404 5 6.749499999999999 24961.905816346345 5 6.754 24865.09782178218 5 6.782 24328.828513719513 5 6.794 24036.088702747365 5 6.797 23864.15348538103 5 6.824999999999999 23726.463053377523 5 6.8375 23452.510150056238 5 6.918 23443.898181818182 5 6.941 23376.74038861689
Exercise 3: Repeat the previous experiment using scikit-learn!
from sklearn.tree import DecisionTreeRegressor
cl = DecisionTreeRegressor(max_depth=1)
cl.fit(X, y)
DecisionTreeRegressor(max_depth=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(max_depth=1)
# Internal parameters of the trained model (.tree_.{feature, threshold, value})
print(cl.tree_.feature)
print(cl.tree_.threshold)
print(cl.tree_.value)
[ 5 -2 -2] [ 6.94099998 -2. -2. ] [[[22.53280632]] [[19.93372093]] [[37.23815789]]]
Execrice 4: Apply 3-fold cross-validation!
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
def evaluate(re, X, y):
cv = KFold(3, random_state=42, shuffle=True)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
re.fit(X[tr], y[tr])
yhat = re.predict(X)
scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)
return np.mean(scores_tr), np.mean(scores_te)
evaluate(DecisionTreeRegressor(max_depth=1), X, y)
(6.748971341330223, 7.435013307929704)
Exercise 5: Determine what maximal depth gives the lowest RMSE!
res = []
for max_depth in range(1, 30):
rmse_tr, rmse_te = evaluate(DecisionTreeRegressor(max_depth=max_depth), X, y)
res.append({
'max_depth': max_depth,
'rmse_tr': rmse_tr,
'rmse_te': rmse_te
})
df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
<AxesSubplot: xlabel='max_depth'>
# Optimal max_depth.
df_res['rmse_te'].idxmin()
5
df_res.loc[df_res['rmse_te'].idxmin()]
rmse_tr 2.545337 rmse_te 4.463304 Name: 5, dtype: float64
# Comparison against ridge regression.
from sklearn.linear_model import Ridge
evaluate(Ridge(), X, y)
(4.736093991350386, 4.8792312907094795)
Exercise 6: Train a decision tree of depth 3 and visualize the trained model!
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
re = DecisionTreeRegressor(max_depth=3)
re.fit(X, y)
plt.figure(figsize=(14, 6))
plot_tree(re, rounded=True, filled=True, feature_names=names)
pass
# Number of parameters in the tree.
(re.tree_.feature >= 0).sum() * 2
14
# Total size of the data set.
X.shape[0] * (X.shape[1] + 1)
6578
Exercise 7: Apply a decision tree classifier for the Wisconsin Breast Cancer data set! Use 5-fold cross-validation! The evaluation metric should be the ratio of correct classifications. Determine the maximal depth that gives the highest accuracy! Compare the best decision tree against logistic regression!
# Load the data to DataFrame.
import pandas as pd
names = [
'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df['Bare_Nuclei'].fillna(0, inplace=True)
df.head()
| Sample_code_number | Clump_Thickness | Uniformity_of_Cell_Size | Uniformity_of_Cell_Shape | Marginal_Adhesion | Single_Epithelial_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 |
| 1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 | 2 |
| 2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 | 2 |
| 3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 | 2 |
| 4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 | 2 |
# Input matrix.
X = df[names[1:-1]].values
X.shape
(699, 9)
# Target vector.
y = df['Class'].values // 2 - 1
# Model evaluation function.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
def evaluate(cl, X, y):
cv = KFold(5, random_state=42, shuffle=True)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
cl.fit(X[tr], y[tr])
yhat = cl.predict(X)
scores_tr.append(accuracy_score(y[tr], yhat[tr]))
scores_te.append(accuracy_score(y[te], yhat[te]))
return np.mean(scores_tr), np.mean(scores_te)
# Trying different max_depth values.
res = []
for max_depth in range(1, 30):
acc_tr, acc_te = evaluate(DecisionTreeClassifier(max_depth=max_depth), X, y)
res.append({
'max_depth': max_depth,
'acc_tr': acc_tr,
'acc_te': acc_te
})
df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
<AxesSubplot: xlabel='max_depth'>
# Optimal max_depth value.
df_res['acc_te'].idxmax()
8
df_res.loc[8]
acc_tr 0.996783 acc_te 0.947061 Name: 8, dtype: float64
# Comparison to logistic regression.
from sklearn.linear_model import LogisticRegression
evaluate(LogisticRegression(), X, y)
(0.9703136979299771, 0.9627954779033916)
Decision trees
Random forest is a tree based algorithm for classification and regression. It can be viewed as a "parallel circuit" of decision trees. The output of the forest is the average output of its trees. Some methods to grow an accurate and diverse set of trees are as follows:
Exercise 1: Implement a bootstrapping based random forest regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation! The evaluation metric should be RMSE!
# Load the data to DataFrame.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
df = df.sample(len(df), random_state=42) # shuffle the data
X = df[df.columns[:-1]].values # input matrix
y = df['MEDV'].values # target vector
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class SimpleRandomForestRegressor:
def __init__(self, n_trees=100, max_depth=None):
self.n_trees = n_trees
self.max_depth = max_depth
def fit(self, X, y):
rs = np.random.RandomState(42)
self.trees = []
for k in range(self.n_trees):
idxs = rs.randint(0, len(y), len(y)) # bootstrapping
re = DecisionTreeRegressor(max_depth=self.max_depth, random_state=k) # create decision tree
re.fit(X[idxs], y[idxs]) # train decision tree
self.trees.append(re)
def predict(self, X):
yhat = np.zeros(len(X))
for re in self.trees:
yhat += re.predict(X)
return yhat / len(self.trees)
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
def evaluate(re, X, y):
cv = KFold(3, random_state=42, shuffle=True)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
re.fit(X[tr], y[tr])
yhat = re.predict(X)
scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)
return np.mean(scores_tr), np.mean(scores_te)
evaluate(SimpleRandomForestRegressor(), X, y)
(1.2496535188109161, 3.5344184688013613)
evaluate(DecisionTreeRegressor(random_state=42), X, y)
(0.0, 4.867211481330203)
from sklearn.linear_model import Ridge
evaluate(Ridge(), X, y)
(4.710820669783009, 4.927289718141437)
# How the test RMSE depends on the number of trees.
res = []
for n_trees in range(1, 101):
print(n_trees)
re = SimpleRandomForestRegressor(n_trees=n_trees)
_, rmse_te = evaluate(re, X, y)
res.append({'n_trees': n_trees, 'rmse_te': rmse_te})
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot()
<AxesSubplot: xlabel='n_trees'>
Exercise 2: Implement a feature bagging based random forest regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation!
Our feature bagging strategy: We will bootstrap the columns of X (we will sample d columns with replacement).
class SimpleRandomForestRegressor:
def __init__(self, n_trees=100, max_depth=None, bootstrap=True, feat_bag=False):
self.n_trees = n_trees
self.max_depth = max_depth
self.bootstrap = bootstrap
self.feat_bag = feat_bag
def fit(self, X, y):
rs = np.random.RandomState(42)
n, d = X.shape
self.trees = []
for k in range(self.n_trees):
row_idxs = np.arange(n)
if self.bootstrap:
row_idxs = rs.randint(0, n, n) # bootstrapping
col_idxs = np.arange(d)
if self.feat_bag:
col_idxs = rs.randint(0, d, d) # feature bagging
tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=k) # create decision tree
tree.fit(X[row_idxs][:, col_idxs], y[row_idxs]) # train decision tree
self.trees.append((tree, col_idxs))
def predict(self, X):
yhat = np.zeros(len(X))
for tree, col_idxs in self.trees:
yhat += tree.predict(X[:, col_idxs])
return yhat / len(self.trees)
evaluate(SimpleRandomForestRegressor(bootstrap=False, feat_bag=False), X, y)
(3.221869811461758e-14, 4.563463998771837)
evaluate(SimpleRandomForestRegressor(bootstrap=True, feat_bag=False), X, y)
(1.2496535188109161, 3.5344184688013613)
evaluate(SimpleRandomForestRegressor(bootstrap=False, feat_bag=True), X, y)
(0.018271831222768983, 3.746377378818943)
evaluate(SimpleRandomForestRegressor(bootstrap=True, feat_bag=True), X, y)
(1.3106116530969623, 3.65460581683147)
Exercise 3: Repeat the previous experiments using scikit-learn's RandomForestRegressor class!
from sklearn.ensemble import RandomForestRegressor
# no bootstrapping, no feature bagging
evaluate(RandomForestRegressor(bootstrap=False, max_features=1.0, random_state=42), X, y)
(3.221869811461758e-14, 4.534326812091633)
# bootstrapping, no feature bagging
evaluate(RandomForestRegressor(bootstrap=True, max_features=1.0, random_state=42), X, y)
(1.249721947092092, 3.578281219483339)
# no bootstrapping, feature bagging (using the 50% strategy)
evaluate(RandomForestRegressor(bootstrap=False, max_features=0.5, random_state=42), X, y)
(0.0008610326361854645, 3.432009311114937)
# bootstrapping and feature bagging (using the 50% strategy)
evaluate(RandomForestRegressor(bootstrap=True, max_features=0.5, random_state=42), X, y)
(1.2368681562001127, 3.464869293320867)
Exercise 4: Plot the training and the test RMSE as a function of the number of trees!
res = []
for n_trees in range(1, 51):
print(n_trees, end=' ')
# no bootstrapping, feature bagging
re = RandomForestRegressor(n_estimators=n_trees, bootstrap=False, max_features=0.5, random_state=42)
rmse_tr, rmse_te = evaluate(re, X, y)
res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
<AxesSubplot: xlabel='n_trees'>
res = []
for n_trees in range(1, 51):
print(n_trees, end=' ')
# bootstrapping, no feature bagging
re = RandomForestRegressor(n_estimators=n_trees, bootstrap=True, max_features=1.0, random_state=42)
rmse_tr, rmse_te = evaluate(re, X, y)
res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
<AxesSubplot: xlabel='n_trees'>
Exercise 5: Repeat the previous experiment using the "extreme randomization" strategy!
from sklearn.ensemble import ExtraTreesRegressor
evaluate(ExtraTreesRegressor(random_state=42), X, y)
(3.221869811461758e-14, 3.5508313733547276)
res = []
for n_trees in range(1, 51):
print(n_trees, end=' ')
re = ExtraTreesRegressor(n_estimators=n_trees, random_state=42)
rmse_tr, rmse_te = evaluate(re, X, y)
res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
<AxesSubplot: xlabel='n_trees'>
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
res = []
re = ExtraTreesRegressor(n_estimators=1, random_state=42)
for n_trees in range(1, 201):
print(n_trees, end=' ')
if n_trees > 1:
re.warm_start = True
re.n_estimators += 1
re.fit(X[tr], y[tr])
yhat = re.predict(X)
rmse_tr = mean_squared_error(y[tr], yhat[tr])**0.5
rmse_te = mean_squared_error(y[te], yhat[te])**0.5
res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
<AxesSubplot: xlabel='n_trees'>
Exercise 1: Implement a tree based gradient boosting regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation! Use a maximal tree depth of 3!
# Load the Boston Housing data set.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
df = df.sample(len(df), random_state=42) # data shuffling
X = df.values[:, :-1] # input matrix
y = df['MEDV'].values # target vector
X.shape
(506, 12)
y.shape
(506,)
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
def evaluate(re, X, y):
cv = KFold(3, random_state=42, shuffle=True)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
re.fit(X[tr], y[tr])
yhat = re.predict(X)
scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)
return np.mean(scores_tr), np.mean(scores_te)
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class SimpleGradientBoostingRegressor:
def __init__(self, n_trees=100, eta=0.1, max_depth=3):
self.n_trees = n_trees
self.eta = eta
self.max_depth = max_depth
def fit(self, X, y):
self.F0 = np.mean(y) # best constant model
r = y - self.F0 # pseudo-residuals
self.trees = []
for m in range(self.n_trees):
tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=m)
tree.fit(X, r) # fit base learner
rhat = tree.predict(X) # base learner's prediction
gamma = (rhat @ r) / (rhat @ rhat) # optimal step size
w = self.eta * gamma
self.trees.append((w, tree)) # save tree and w
r -= w * rhat
def predict(self, X):
yhat = np.ones(len(X)) * self.F0
for w, tree in self.trees:
yhat += w * tree.predict(X)
return yhat
evaluate(SimpleGradientBoostingRegressor(), X, y)
(1.2716614748909418, 3.363000134587033)
Exercise 2: Repeat the previous experiment using scikit-learn!
from sklearn.ensemble import GradientBoostingRegressor
evaluate(GradientBoostingRegressor(random_state=42), X, y)
(1.2716614748909418, 3.36820868174328)
Exercise 3: Which tree depth gives the most accurate ensemble?
res = []
for max_depth in range(1, 13):
print(max_depth, end=' ')
re = GradientBoostingRegressor(max_depth=max_depth, random_state=42)
rmse_tr, rmse_te = evaluate(re, X, y)
res.append({'max_depth': max_depth, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
<AxesSubplot: xlabel='max_depth'>
Exercise 3/B: How the training and test RMSE changes with the number of trees? (Use a simple train-test split for this experiment!)
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
res = []
re = GradientBoostingRegressor(n_estimators=1, random_state=42)
for n_trees in range(1, 401):
print(n_trees, end=' ')
if n_trees > 1:
re.warm_start = True
re.n_estimators += 1
re.fit(X[tr], y[tr])
yhat = re.predict(X)
rmse_tr = mean_squared_error(y[tr], yhat[tr])**0.5
rmse_te = mean_squared_error(y[te], yhat[te])**0.5
res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
<AxesSubplot: xlabel='n_trees'>
# What happens if we use deeper trees?
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
res = []
re = GradientBoostingRegressor(n_estimators=1, max_depth=6, random_state=42)
for n_trees in range(1, 401):
print(n_trees, end=' ')
if n_trees > 1:
re.warm_start = True
re.n_estimators += 1
re.fit(X[tr], y[tr])
yhat = re.predict(X)
rmse_tr = mean_squared_error(y[tr], yhat[tr])**0.5
rmse_te = mean_squared_error(y[te], yhat[te])**0.5
res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
<AxesSubplot: xlabel='n_trees'>
Exercise 4: Apply a random forest and a gradient boosting classifier on the Wisconsin Breast Cancer data set! Use stratified 10-fold cross-validation! The evaluation metric should be the ratio of correct classifications. For both ensemble methods, determine the maximal tree depth that gives the highest accuracy!
# Load the Wisconsin Breast Cancer data set.
import pandas as pd
names = [
'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df = df.sample(len(df), random_state=42) # data shuffling
df['Bare_Nuclei'].fillna(0, inplace=True)
X = df[df.columns[1: -1]].values
y = (df['Class'].values / 2 - 1).astype('int')
X.shape, y.shape
((699, 9), (699,))
# Evaluation function.
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
def evaluate(cl, X, y):
cv = StratifiedKFold(10, shuffle=True, random_state=42)
scores = []
for tr, te in cv.split(X, y):
cl.fit(X[tr], y[tr])
yhat = cl.predict(X) # label prediction
score = accuracy_score(y[te], yhat[te])
scores.append(score)
return np.mean(scores)
# Dummy classifier's accuracy.
from sklearn.dummy import DummyClassifier
evaluate(DummyClassifier(), X, y)
0.6552173913043479
# Logistic regression's accuracy.
from sklearn.linear_model import LogisticRegression
evaluate(LogisticRegression(), X, y)
0.9656314699792962
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
evaluate(GradientBoostingClassifier(random_state=42), X, y)
0.9527536231884058
evaluate(RandomForestClassifier(random_state=42), X, y)
0.9698964803312631
# gradient boosting, random forest, different max_depth values, ...
res = []
for max_depth in list(range(1, 11)):
print(max_depth, end=' ')
gb = GradientBoostingClassifier(max_depth=max_depth, random_state=42)
rf = RandomForestClassifier(max_depth=max_depth, random_state=42)
res.append({'max_depth': max_depth, 'RF_acc': evaluate(rf, X, y), 'GB_acc': evaluate(gb, X, y)})
1 2 3 4 5 6 7 8 9 10
df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
<AxesSubplot: xlabel='max_depth'>
Exercise 5: Compare XGBoost, LightGBM and scikit-learn's GradientBoostingClassifier on the Wisconsin Breast Cancer problem, in terms of speed and accuracy!
import xgboost
xgboost.__version__
'1.7.3'
import lightgbm
lightgbm.__version__
'3.3.5'
import time
from xgboost import XGBClassifier
t0= time.time()
print('acc:', evaluate(XGBClassifier(n_estimators=100, max_depth=3), X, y))
print('time:', time.time() - t0)
acc: 0.9585093167701864 time: 0.4498872756958008
from lightgbm import LGBMClassifier
t0= time.time()
print('acc:', evaluate(LGBMClassifier(n_estimators=100, num_leaves=8), X, y))
print('time:', time.time() - t0)
acc: 0.9585300207039337 time: 0.28879475593566895
t0= time.time()
print('acc:', evaluate(GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42), X, y))
print('time:', time.time() - t0)
acc: 0.9527536231884058 time: 1.1589224338531494
# Since version 0.21, scikit-learn includes a histogram based
# gradient boosting algorithm that was inspired by LightGBM.
from sklearn.ensemble import HistGradientBoostingClassifier
t0= time.time()
print('acc:', evaluate(HistGradientBoostingClassifier(max_iter=100, max_leaf_nodes=8, random_state=42), X, y))
print('time:', time.time() - t0)
acc: 0.9585093167701864 time: 1.296454668045044
The Phishing Websites data set contains certain attributes of web sites. The target attribute is the last column. It specifies whether the site is legitimate (-1) or phishing (+1). Our goal will be to build an artificial neural network that predicts the value of the target attribute.
Exercise 1: Load the Phishing Websites data set to a data frame. Prepare the input matrix and the target vector.
# Load data.
import pandas as pd
from urllib.request import urlopen
url = 'https://archive.ics.uci.edu/'\
'ml/machine-learning-databases/00327/Training%20Dataset.arff'
lines = urlopen(url).read().decode('utf-8').split('\r\n')
names = [l.split()[1] for l in lines if l.startswith('@att')]
skiprows = lines.index('@data') + 1
df = pd.read_csv(url, names=names, skiprows=skiprows)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11055 entries, 0 to 11054 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 having_IP_Address 11055 non-null int64 1 URL_Length 11055 non-null int64 2 Shortining_Service 11055 non-null int64 3 having_At_Symbol 11055 non-null int64 4 double_slash_redirecting 11055 non-null int64 5 Prefix_Suffix 11055 non-null int64 6 having_Sub_Domain 11055 non-null int64 7 SSLfinal_State 11055 non-null int64 8 Domain_registeration_length 11055 non-null int64 9 Favicon 11055 non-null int64 10 port 11055 non-null int64 11 HTTPS_token 11055 non-null int64 12 Request_URL 11055 non-null int64 13 URL_of_Anchor 11055 non-null int64 14 Links_in_tags 11055 non-null int64 15 SFH 11055 non-null int64 16 Submitting_to_email 11055 non-null int64 17 Abnormal_URL 11055 non-null int64 18 Redirect 11055 non-null int64 19 on_mouseover 11055 non-null int64 20 RightClick 11055 non-null int64 21 popUpWidnow 11055 non-null int64 22 Iframe 11055 non-null int64 23 age_of_domain 11055 non-null int64 24 DNSRecord 11055 non-null int64 25 web_traffic 11055 non-null int64 26 Page_Rank 11055 non-null int64 27 Google_Index 11055 non-null int64 28 Links_pointing_to_page 11055 non-null int64 29 Statistical_report 11055 non-null int64 30 Result 11055 non-null int64 dtypes: int64(31) memory usage: 2.6 MB
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| having_IP_Address | 11055.0 | 0.313795 | 0.949534 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
| URL_Length | 11055.0 | -0.633198 | 0.766095 | -1.0 | -1.0 | -1.0 | -1.0 | 1.0 |
| Shortining_Service | 11055.0 | 0.738761 | 0.673998 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| having_At_Symbol | 11055.0 | 0.700588 | 0.713598 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| double_slash_redirecting | 11055.0 | 0.741474 | 0.671011 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Prefix_Suffix | 11055.0 | -0.734962 | 0.678139 | -1.0 | -1.0 | -1.0 | -1.0 | 1.0 |
| having_Sub_Domain | 11055.0 | 0.063953 | 0.817518 | -1.0 | -1.0 | 0.0 | 1.0 | 1.0 |
| SSLfinal_State | 11055.0 | 0.250927 | 0.911892 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
| Domain_registeration_length | 11055.0 | -0.336771 | 0.941629 | -1.0 | -1.0 | -1.0 | 1.0 | 1.0 |
| Favicon | 11055.0 | 0.628584 | 0.777777 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| port | 11055.0 | 0.728268 | 0.685324 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| HTTPS_token | 11055.0 | 0.675079 | 0.737779 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Request_URL | 11055.0 | 0.186793 | 0.982444 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
| URL_of_Anchor | 11055.0 | -0.076526 | 0.715138 | -1.0 | -1.0 | 0.0 | 0.0 | 1.0 |
| Links_in_tags | 11055.0 | -0.118137 | 0.763973 | -1.0 | -1.0 | 0.0 | 0.0 | 1.0 |
| SFH | 11055.0 | -0.595749 | 0.759143 | -1.0 | -1.0 | -1.0 | -1.0 | 1.0 |
| Submitting_to_email | 11055.0 | 0.635640 | 0.772021 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Abnormal_URL | 11055.0 | 0.705292 | 0.708949 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Redirect | 11055.0 | 0.115694 | 0.319872 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| on_mouseover | 11055.0 | 0.762099 | 0.647490 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| RightClick | 11055.0 | 0.913885 | 0.405991 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| popUpWidnow | 11055.0 | 0.613388 | 0.789818 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Iframe | 11055.0 | 0.816915 | 0.576784 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| age_of_domain | 11055.0 | 0.061239 | 0.998168 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
| DNSRecord | 11055.0 | 0.377114 | 0.926209 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
| web_traffic | 11055.0 | 0.287291 | 0.827733 | -1.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| Page_Rank | 11055.0 | -0.483673 | 0.875289 | -1.0 | -1.0 | -1.0 | 1.0 | 1.0 |
| Google_Index | 11055.0 | 0.721574 | 0.692369 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Links_pointing_to_page | 11055.0 | 0.344007 | 0.569944 | -1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| Statistical_report | 11055.0 | 0.719584 | 0.694437 | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Result | 11055.0 | 0.113885 | 0.993539 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
X = df[df.columns[:-1]].values # input matrix
y = ((df['Result'] + 1) / 2).values # target vector
X.shape, y.shape, X.sum(), y.mean()
((11055, 30), (11055,), 100854, 0.5569425599276345)
Exercise 2: Implement a multilayer perceptron classifier from scratch! Use stochastic gradient descent for training. Evaluate the model on the Phishing Websites data set using a 70%-30% train-test split!
import numpy as np
def sigmoid(t):
return 1 / (1 + np.exp(-t))
class SimpleMLPClassifier:
def __init__(self, n_hidden=32, init_range=0.1, n_epochs=5, learning_rate=0.01, random_state=42):
self.n_hidden = n_hidden
self.init_range = init_range
self.n_epochs = n_epochs
self.learning_rate = learning_rate
self.random_state = random_state
def _forward(self, x_i):
h_i = sigmoid(self.W.T @ x_i) # hidden activations
yhat_i = sigmoid(self.v @ h_i) # output activation
return h_i, yhat_i
def fit(self, X, y):
# model initialization
n, d = X.shape
rs = np.random.RandomState(self.random_state)
self.W = rs.uniform(-self.init_range, self.init_range, (d, self.n_hidden)) # hidden weights
self.v = rs.uniform(-self.init_range, self.init_range, self.n_hidden) # output weights
for e in range(self.n_epochs):
for i in range(n):
h_i, yhat_i = self._forward(X[i]) # propagate the the signal forward
grad_v = (yhat_i - y[i]) * h_i # derivative w.r.t. v
eps_i = (yhat_i - y[i]) * self.v * h_i * (1 - h_i) # backpropagated error
grad_W = np.outer(X[i], eps_i) # derivative w.r.t. W
# update model parameters
self.v -= self.learning_rate * grad_v
self.W -= self.learning_rate * grad_W
return self
def predict(self, X):
return (self.predict_proba(X)[:, 1] > 0.5).astype('int')
def predict_proba(self, X):
n = X.shape[0]
Yhat = np.zeros((n, 2))
for i in range(n):
_, yhat_i = self._forward(X[i])
Yhat[i, 1] = yhat_i
Yhat[:, 0] = 1 - Yhat[:, 1]
return Yhat
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
for n_epochs in range(6):
cl = SimpleMLPClassifier(n_epochs=n_epochs)
cl.fit(X[tr], y[tr])
print(n_epochs, accuracy_score(cl.predict(X[te]), y[te]))
0 0.43050949653301174 1 0.9201085318058486 2 0.9182996683750377 3 0.9173952366596322 4 0.9179981911365692 5 0.9176967138981007
Excercise 3: Compare the previous solution against scikit-learn's MLPClassifier!
from sklearn.neural_network import MLPClassifier
for n_epochs in range(1, 4):
cl = MLPClassifier(
hidden_layer_sizes=(32,), activation='logistic', solver='sgd', batch_size=1,
learning_rate_init=0.01, max_iter=n_epochs, momentum=0, alpha=0, random_state=42
) # (the intercept terms cannot be switched off)
cl.fit(X[tr], y[tr])
print(n_epochs, accuracy_score(cl.predict(X[te]), y[te]))
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet. warnings.warn(
1 0.9186011456135061
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2) reached and the optimization hasn't converged yet. warnings.warn(
2 0.9198070545673802 3 0.9207114862827857
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (3) reached and the optimization hasn't converged yet. warnings.warn(
Excercise 4: Optimize the meta-parameters of the neural network!
def evaluate(cl, X, y):
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
cl.fit(X[tr], y[tr])
return accuracy_score(cl.predict(X[te]), y[te])
# default settings
evaluate(MLPClassifier(random_state=42), X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
0.9644256858607175
# some hand optimization
cl = MLPClassifier(learning_rate_init=0.01, max_iter=50, batch_size=64, random_state=42)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
0.9692493216762135
from sklearn.model_selection import GridSearchCV
cl = MLPClassifier()
param_grid = {
'learning_rate_init': [0.01, 0.02, 0.05],
'max_iter': [50, 75],
'batch_size': [48, 64, 72],
'random_state': [42],
}
cv = ShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
gs = GridSearchCV(cl, param_grid, cv=cv, verbose=2)
gs.fit(X, y)
Fitting 1 folds for each of 18 candidates, totalling 18 fits
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=48, learning_rate_init=0.01, max_iter=50, random_state=42; total time= 14.3s [CV] END batch_size=48, learning_rate_init=0.01, max_iter=75, random_state=42; total time= 19.6s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=48, learning_rate_init=0.02, max_iter=50, random_state=42; total time= 12.0s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (75) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=48, learning_rate_init=0.02, max_iter=75, random_state=42; total time= 15.7s [CV] END batch_size=48, learning_rate_init=0.05, max_iter=50, random_state=42; total time= 7.7s [CV] END batch_size=48, learning_rate_init=0.05, max_iter=75, random_state=42; total time= 9.3s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=64, learning_rate_init=0.01, max_iter=50, random_state=42; total time= 14.2s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (75) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=64, learning_rate_init=0.01, max_iter=75, random_state=42; total time= 17.8s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=64, learning_rate_init=0.02, max_iter=50, random_state=42; total time= 6.9s [CV] END batch_size=64, learning_rate_init=0.02, max_iter=75, random_state=42; total time= 10.4s [CV] END batch_size=64, learning_rate_init=0.05, max_iter=50, random_state=42; total time= 7.1s [CV] END batch_size=64, learning_rate_init=0.05, max_iter=75, random_state=42; total time= 7.2s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=72, learning_rate_init=0.01, max_iter=50, random_state=42; total time= 6.0s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (75) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=72, learning_rate_init=0.01, max_iter=75, random_state=42; total time= 9.7s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
[CV] END batch_size=72, learning_rate_init=0.02, max_iter=50, random_state=42; total time= 6.4s [CV] END batch_size=72, learning_rate_init=0.02, max_iter=75, random_state=42; total time= 7.2s [CV] END batch_size=72, learning_rate_init=0.05, max_iter=50, random_state=42; total time= 7.3s [CV] END batch_size=72, learning_rate_init=0.05, max_iter=75, random_state=42; total time= 7.4s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. warnings.warn(
GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=42, test_size=0.3, train_size=None),
estimator=MLPClassifier(),
param_grid={'batch_size': [48, 64, 72],
'learning_rate_init': [0.01, 0.02, 0.05],
'max_iter': [50, 75], 'random_state': [42]},
verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=42, test_size=0.3, train_size=None),
estimator=MLPClassifier(),
param_grid={'batch_size': [48, 64, 72],
'learning_rate_init': [0.01, 0.02, 0.05],
'max_iter': [50, 75], 'random_state': [42]},
verbose=2)MLPClassifier()
MLPClassifier()
df_res = pd.DataFrame(gs.cv_results_)
columns = ['param_batch_size', 'param_learning_rate_init', 'param_max_iter', 'split0_test_score']
df_res.sort_values(columns[-1])[::-1][columns]
| param_batch_size | param_learning_rate_init | param_max_iter | split0_test_score | |
|---|---|---|---|---|
| 6 | 64 | 0.01 | 50 | 0.969249 |
| 7 | 64 | 0.01 | 75 | 0.965632 |
| 12 | 72 | 0.01 | 50 | 0.965330 |
| 0 | 48 | 0.01 | 50 | 0.965029 |
| 15 | 72 | 0.02 | 75 | 0.963521 |
| 13 | 72 | 0.01 | 75 | 0.963521 |
| 3 | 48 | 0.02 | 75 | 0.962918 |
| 2 | 48 | 0.02 | 50 | 0.961411 |
| 1 | 48 | 0.01 | 75 | 0.960808 |
| 8 | 64 | 0.02 | 50 | 0.958396 |
| 9 | 64 | 0.02 | 75 | 0.957190 |
| 10 | 64 | 0.05 | 50 | 0.952970 |
| 11 | 64 | 0.05 | 75 | 0.952970 |
| 14 | 72 | 0.02 | 50 | 0.952367 |
| 16 | 72 | 0.05 | 50 | 0.949955 |
| 17 | 72 | 0.05 | 75 | 0.949955 |
| 5 | 48 | 0.05 | 75 | 0.942116 |
| 4 | 48 | 0.05 | 50 | 0.942116 |
# Bernold's solution
cl = MLPClassifier(learning_rate_init=0.01, hidden_layer_sizes=(100, 100, 30), random_state=42)
evaluate(cl, X, y)
0.9701537533916189
cl = MLPClassifier(
learning_rate_init=0.01, hidden_layer_sizes=(100, 100, 30),
random_state=42, solver='sgd'
)
evaluate(cl, X, y)
0.963521254145312
cl = MLPClassifier(
learning_rate_init=0.01, hidden_layer_sizes=(100, 100, 30),
random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
0.9656315948145915
cl = MLPClassifier(
learning_rate_init=0.02, hidden_layer_sizes=(100, 100, 30),
random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
0.9674404582454025
cl = MLPClassifier(
learning_rate_init=0.04, hidden_layer_sizes=(100, 100, 30),
random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
0.9686463671992764
cl = MLPClassifier(
learning_rate_init=0.08, hidden_layer_sizes=(100, 100, 30),
random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
0.966536026529997
Exercise 1: Visualize the Wisconsin Breast Cancer data set using PCA!
# Load the data to DataFrame.
import pandas as pd
names = [
'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df['Bare_Nuclei'].fillna(0, inplace=True)
df.head()
| Sample_code_number | Clump_Thickness | Uniformity_of_Cell_Size | Uniformity_of_Cell_Shape | Marginal_Adhesion | Single_Epithelial_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 |
| 1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 | 2 |
| 2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 | 2 |
| 3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 | 2 |
| 4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 | 2 |
X = df[df.columns[1:-1]].values # input matrix
y = (df['Class'] // 2 - 1).values # target voctor
X.shape
(699, 9)
# Scikit-Learn based implementation.
from sklearn.decomposition import PCA
pca = PCA(2)
pca.fit(X) # computes the eigenvalue decomposition of X^T * X
Z = pca.transform(X) # computes X * W_K
# fit(), then transform()
Z = pca.fit_transform(X)
Z
array([[-4.40904732, 0.01994907],
[ 4.88110465, -4.90332617],
[-4.56415067, -0.65506726],
...,
[10.33687644, 7.24412961],
[ 6.47179821, 2.54870704],
[ 7.56560301, 1.23300626]])
# NumPy based implementation.
import numpy as np
X2 = X - X.mean(axis=0) # subtract column means
L, W = np.linalg.eig(X2.T @ X2) # eigenvalue decomposition
X2 @ W[:, :2]
array([[ 4.40904732, -0.01994907],
[ -4.88110465, 4.90332617],
[ 4.56415067, 0.65506726],
...,
[-10.33687644, -7.24412961],
[ -6.47179821, -2.54870704],
[ -7.56560301, -1.23300626]])
# Visualization.
import matplotlib.pyplot as plt
def visualize(Z, y):
cond = (y == 0)
plt.figure(figsize=(8, 8))
plt.scatter(Z[cond, 0], Z[cond, 1], alpha=0.6)
plt.scatter(Z[~cond, 0], Z[~cond, 1], alpha=0.6)
plt.legend(['benign', 'malignant'])
visualize(Z, y)
Exercise 2: Measure the cross-validation accuracy of gradient boosting, so that we reduce the dimension to 1, 2, ..., 9 using PCA.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
def evaluate(cl, X, y):
cv = KFold(5, shuffle=True, random_state=42)
scores = []
for tr, te in cv.split(X):
cl.fit(X[tr], y[tr])
yhat = cl.predict(X)
scores.append(accuracy_score(y[te], yhat[te]))
return np.mean(scores)
evaluate(GradientBoostingClassifier(), X, y)
0.9556423432682426
evaluate(GradientBoostingClassifier(), X - X.mean(axis=0), y)
0.9542137718396712
for n_components in range(1, X.shape[1] + 1):
X2 = PCA(n_components).fit_transform(X)
acc = evaluate(GradientBoostingClassifier(), X2, y)
print(n_components, acc)
1 0.9627852004110998 2 0.9570503597122302 3 0.9613360739979445 4 0.9656526207605344 5 0.9642240493319629 6 0.9642240493319629 7 0.9642240493319629 8 0.9599280575539568 9 0.9627852004110997
Exercise 3: Prepare the t-SNE visualization of the Wisconsin Breast Cancer data set.
from sklearn.manifold import TSNE
tsne = TSNE(2)
Z = tsne.fit_transform(X)
visualize(Z, y)
# Try different perplexity values!
visualize(TSNE(2, perplexity=60).fit_transform(X), y)
visualize(TSNE(2, perplexity=15).fit_transform(X), y)
visualize(TSNE(2, perplexity=7.5).fit_transform(X), y)
# Exercise: Create a t-SNE visualization of the Bike Sharing data set!
fname = '../_data/bike_sharing_data.txt'
df = pd.read_csv(fname)
df
| instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0000 | 3 | 13 | 16 |
| 1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 8 | 32 | 40 |
| 2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 5 | 27 | 32 |
| 3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 3 | 10 | 13 |
| 4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 17374 | 17375 | 2012-12-31 | 1 | 1 | 12 | 19 | 0 | 1 | 1 | 2 | 0.26 | 0.2576 | 0.60 | 0.1642 | 11 | 108 | 119 |
| 17375 | 17376 | 2012-12-31 | 1 | 1 | 12 | 20 | 0 | 1 | 1 | 2 | 0.26 | 0.2576 | 0.60 | 0.1642 | 8 | 81 | 89 |
| 17376 | 17377 | 2012-12-31 | 1 | 1 | 12 | 21 | 0 | 1 | 1 | 1 | 0.26 | 0.2576 | 0.60 | 0.1642 | 7 | 83 | 90 |
| 17377 | 17378 | 2012-12-31 | 1 | 1 | 12 | 22 | 0 | 1 | 1 | 1 | 0.26 | 0.2727 | 0.56 | 0.1343 | 13 | 48 | 61 |
| 17378 | 17379 | 2012-12-31 | 1 | 1 | 12 | 23 | 0 | 1 | 1 | 1 | 0.26 | 0.2727 | 0.65 | 0.1343 | 12 | 37 | 49 |
17379 rows Ć 17 columns
# Extract input matrix.
columns = [
'season', 'yr', 'mnth', 'hr', 'holiday',
'weekday', 'workingday', 'weathersit', 'temp', 'atemp',
'hum', 'windspeed'
]
df[columns].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| season | 17379.0 | 2.501640 | 1.106918 | 1.00 | 2.0000 | 3.0000 | 3.0000 | 4.0000 |
| yr | 17379.0 | 0.502561 | 0.500008 | 0.00 | 0.0000 | 1.0000 | 1.0000 | 1.0000 |
| mnth | 17379.0 | 6.537775 | 3.438776 | 1.00 | 4.0000 | 7.0000 | 10.0000 | 12.0000 |
| hr | 17379.0 | 11.546752 | 6.914405 | 0.00 | 6.0000 | 12.0000 | 18.0000 | 23.0000 |
| holiday | 17379.0 | 0.028770 | 0.167165 | 0.00 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| weekday | 17379.0 | 3.003683 | 2.005771 | 0.00 | 1.0000 | 3.0000 | 5.0000 | 6.0000 |
| workingday | 17379.0 | 0.682721 | 0.465431 | 0.00 | 0.0000 | 1.0000 | 1.0000 | 1.0000 |
| weathersit | 17379.0 | 1.425283 | 0.639357 | 1.00 | 1.0000 | 1.0000 | 2.0000 | 4.0000 |
| temp | 17379.0 | 0.496987 | 0.192556 | 0.02 | 0.3400 | 0.5000 | 0.6600 | 1.0000 |
| atemp | 17379.0 | 0.475775 | 0.171850 | 0.00 | 0.3333 | 0.4848 | 0.6212 | 1.0000 |
| hum | 17379.0 | 0.627229 | 0.192930 | 0.00 | 0.4800 | 0.6300 | 0.7800 | 1.0000 |
| windspeed | 17379.0 | 0.190098 | 0.122340 | 0.00 | 0.1045 | 0.1940 | 0.2537 | 0.8507 |
X = df[columns].values
X /= X.std(axis=0)
X.shape
(17379, 12)
def visualize2(Z, y):
plt.figure(figsize=(8, 8))
plt.scatter(Z[:, 0], Z[:, 1], c=y, alpha=0.6)
plt.colorbar()
n = 2000
Z = TSNE(2).fit_transform(X[:n])
visualize2(Z, df['cnt'].values[:n])
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | TN | FP |
| Actual Positive | FN | TP |
Exercise 1: The file ctg_data.txt contains data about cardiotocographic examinations. See ctg_names.txt for more details. Build a classifier that estimates if the value of the NSP column is 1 (meaning "normal") or not! Use 70%-30% train-test split for evaluation! Measure the accuracy, balanced accuracy, precision, recall, and F1-score of the trained model!
# Load data.
import pandas as pd
df = pd.read_csv('../_data/ctg_data.txt')
df = df.sample(len(df), random_state=42) # permute rows
df
| LB | AC | FM | UC | DL | DS | DP | ASTV | MSTV | ALTV | ... | Min | Max | Nmax | Nzeros | Mode | Mean | Median | Variance | Tendency | NSP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 282 | 133 | 0.002 | 0.010 | 0.003 | 0.002 | 0.0 | 0.000 | 46 | 1.1 | 0 | ... | 95 | 164 | 5 | 0 | 139 | 135 | 138 | 9 | 0 | 1 |
| 1999 | 125 | 0.000 | 0.001 | 0.009 | 0.008 | 0.0 | 0.000 | 62 | 1.7 | 0 | ... | 68 | 140 | 5 | 0 | 130 | 116 | 125 | 29 | 1 | 1 |
| 1709 | 131 | 0.004 | 0.003 | 0.004 | 0.005 | 0.0 | 0.001 | 60 | 2.1 | 0 | ... | 78 | 168 | 8 | 0 | 133 | 127 | 132 | 21 | 0 | 1 |
| 988 | 131 | 0.011 | 0.000 | 0.005 | 0.000 | 0.0 | 0.000 | 29 | 1.3 | 0 | ... | 82 | 171 | 8 | 0 | 143 | 145 | 145 | 9 | 1 | 1 |
| 2018 | 125 | 0.000 | 0.000 | 0.008 | 0.007 | 0.0 | 0.001 | 64 | 1.3 | 0 | ... | 78 | 155 | 4 | 0 | 114 | 111 | 114 | 7 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1638 | 130 | 0.009 | 0.001 | 0.004 | 0.001 | 0.0 | 0.000 | 52 | 1.3 | 0 | ... | 73 | 172 | 6 | 0 | 144 | 141 | 144 | 16 | 1 | 1 |
| 1095 | 123 | 0.012 | 0.000 | 0.002 | 0.000 | 0.0 | 0.000 | 22 | 2.2 | 0 | ... | 100 | 152 | 2 | 0 | 131 | 132 | 133 | 4 | 0 | 1 |
| 1130 | 122 | 0.005 | 0.000 | 0.004 | 0.005 | 0.0 | 0.000 | 20 | 2.6 | 0 | ... | 60 | 158 | 6 | 0 | 131 | 121 | 126 | 31 | 0 | 1 |
| 1294 | 115 | 0.003 | 0.000 | 0.008 | 0.002 | 0.0 | 0.001 | 24 | 1.6 | 0 | ... | 71 | 179 | 3 | 2 | 133 | 122 | 129 | 45 | 0 | 1 |
| 860 | 142 | 0.001 | 0.000 | 0.004 | 0.000 | 0.0 | 0.000 | 39 | 0.9 | 0 | ... | 127 | 159 | 1 | 0 | 151 | 147 | 149 | 4 | 1 | 1 |
2126 rows Ć 22 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2126 entries, 282 to 860 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LB 2126 non-null int64 1 AC 2126 non-null float64 2 FM 2126 non-null float64 3 UC 2126 non-null float64 4 DL 2126 non-null float64 5 DS 2126 non-null float64 6 DP 2126 non-null float64 7 ASTV 2126 non-null int64 8 MSTV 2126 non-null float64 9 ALTV 2126 non-null int64 10 MLTV 2126 non-null float64 11 Width 2126 non-null int64 12 Min 2126 non-null int64 13 Max 2126 non-null int64 14 Nmax 2126 non-null int64 15 Nzeros 2126 non-null int64 16 Mode 2126 non-null int64 17 Mean 2126 non-null int64 18 Median 2126 non-null int64 19 Variance 2126 non-null int64 20 Tendency 2126 non-null int64 21 NSP 2126 non-null int64 dtypes: float64(8), int64(14) memory usage: 382.0 KB
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LB | 2126.0 | 133.303857 | 9.840844 | 106.0 | 126.000 | 133.000 | 140.000 | 160.000 |
| AC | 2126.0 | 0.003178 | 0.003866 | 0.0 | 0.000 | 0.002 | 0.006 | 0.019 |
| FM | 2126.0 | 0.009481 | 0.046666 | 0.0 | 0.000 | 0.000 | 0.003 | 0.481 |
| UC | 2126.0 | 0.004366 | 0.002946 | 0.0 | 0.002 | 0.004 | 0.007 | 0.015 |
| DL | 2126.0 | 0.001889 | 0.002960 | 0.0 | 0.000 | 0.000 | 0.003 | 0.015 |
| DS | 2126.0 | 0.000003 | 0.000057 | 0.0 | 0.000 | 0.000 | 0.000 | 0.001 |
| DP | 2126.0 | 0.000159 | 0.000590 | 0.0 | 0.000 | 0.000 | 0.000 | 0.005 |
| ASTV | 2126.0 | 46.990122 | 17.192814 | 12.0 | 32.000 | 49.000 | 61.000 | 87.000 |
| MSTV | 2126.0 | 1.332785 | 0.883241 | 0.2 | 0.700 | 1.200 | 1.700 | 7.000 |
| ALTV | 2126.0 | 9.846660 | 18.396880 | 0.0 | 0.000 | 0.000 | 11.000 | 91.000 |
| MLTV | 2126.0 | 8.187629 | 5.628247 | 0.0 | 4.600 | 7.400 | 10.800 | 50.700 |
| Width | 2126.0 | 70.445908 | 38.955693 | 3.0 | 37.000 | 67.500 | 100.000 | 180.000 |
| Min | 2126.0 | 93.579492 | 29.560212 | 50.0 | 67.000 | 93.000 | 120.000 | 159.000 |
| Max | 2126.0 | 164.025400 | 17.944183 | 122.0 | 152.000 | 162.000 | 174.000 | 238.000 |
| Nmax | 2126.0 | 4.068203 | 2.949386 | 0.0 | 2.000 | 3.000 | 6.000 | 18.000 |
| Nzeros | 2126.0 | 0.323612 | 0.706059 | 0.0 | 0.000 | 0.000 | 0.000 | 10.000 |
| Mode | 2126.0 | 137.452023 | 16.381289 | 60.0 | 129.000 | 139.000 | 148.000 | 187.000 |
| Mean | 2126.0 | 134.610536 | 15.593596 | 73.0 | 125.000 | 136.000 | 145.000 | 182.000 |
| Median | 2126.0 | 138.090310 | 14.466589 | 77.0 | 129.000 | 139.000 | 148.000 | 186.000 |
| Variance | 2126.0 | 18.808090 | 28.977636 | 0.0 | 2.000 | 7.000 | 24.000 | 269.000 |
| Tendency | 2126.0 | 0.320320 | 0.610829 | -1.0 | 0.000 | 0.000 | 1.000 | 1.000 |
| NSP | 2126.0 | 1.304327 | 0.614377 | 1.0 | 1.000 | 1.000 | 1.000 | 3.000 |
X = df[df.columns[:-1]].values # input matrix
y = (df['NSP'] > 1).astype('int') # target vector (NSP:1 => 0, NSP:{2,3} => 1)
# standardize input matrix
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
X.shape, y.shape, y.mean()
((2126, 21), (2126,), 0.2215428033866416)
# generate train-test split
# from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit
def gen_split(X, y):
return next(StratifiedShuffleSplit(test_size=0.3, random_state=42).split(X, y))
tr, te = gen_split(X, y)
tr.shape, te.shape
((1488,), (638,))
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score
def evaluate(cl, X, y):
tr, te = gen_split(X, y)
cl.fit(X[tr], y[tr])
yhat = cl.predict(X)
return {
'acc': accuracy_score(y[te], yhat[te]),
'bal_acc': balanced_accuracy_score(y[te], yhat[te]),
'prec': precision_score(y[te], yhat[te]),
'rec': recall_score(y[te], yhat[te]),
'f1': f1_score(y[te], yhat[te])
}
evaluate(DummyClassifier(), X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
{'acc': 0.7789968652037618, 'bal_acc': 0.5, 'prec': 0.0, 'rec': 0.0, 'f1': 0.0}
evaluate(LogisticRegression(C=1, class_weight='balanced'), X, y)
{'acc': 0.5235109717868338,
'bal_acc': 0.47825962869415073,
'prec': 0.20363636363636364,
'rec': 0.3971631205673759,
'f1': 0.2692307692307693}
evaluate(GradientBoostingClassifier(max_depth=4, random_state=42), X, y)
{'acc': 0.7586206896551724,
'bal_acc': 0.5097820968363372,
'prec': 0.2903225806451613,
'rec': 0.06382978723404255,
'f1': 0.10465116279069768}
Exercise 2: Prepare the confusion matrix of the trained model! Compute accuracy, balanced accuracy, precision, recall and F1-score from the elements of the confusion matrix!
from sklearn.metrics import confusion_matrix
cl = GradientBoostingClassifier(max_depth=4, random_state=42)
cl.fit(X[tr], y[tr])
yhat = cl.predict(X)
(tn, fp), (fn, tp) = confusion_matrix(y[te], yhat[te])
p = tp + fn # total number of actual positives
n = tn + fp # total number of actual negatives
print('accuracy:', (tp + tn) / (p + n))
print('balanced accuracy:', ((tp / p) + (tn / n)) / 2)
prec = tp / (tp + fp)
print('precision:', prec)
rec = tp / p
print('recall:', rec)
print('f1:', 2 / (1 / prec + 1 / rec))
accuracy: 0.7586206896551724 balanced accuracy: 0.5097820968363372 precision: 0.2903225806451613 recall: 0.06382978723404255 f1: 0.10465116279069768
Exercise 3: Optimize the decision threshold of the trained model in terms of the the test F1-score.
import numpy as np
p = cl.predict_proba(X)[:, 1]
data = []
for th in np.arange(0, 1.05, 0.05):
yhat = (p > th) # probability prediction => label prediction
data.append({
'th': th,
'prec': precision_score(y[te], yhat[te], zero_division=0),
'rec': recall_score(y[te], yhat[te]),
'f1': f1_score(y[te], yhat[te])
})
df2 = pd.DataFrame(data).set_index('th')
df2.plot()
<AxesSubplot: xlabel='th'>
# optimal threshold
df2['f1'].idxmax()
0.1
Exercise 4: Display the ROC curve corresponding to the trained model! Compute the area under the ROC curve!
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, th = roc_curve(y[te], p[te])
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], '--')
plt.xlabel('false positive rate')
plt.xlabel('true positive rate')
Text(0.5, 0, 'true positive rate')
# Train a classifier on the Wisconsin Breast cancer data set,
# and draw its ROC curve!
# Column names.
names = [
'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape',
'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin',
'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df['Bare_Nuclei'].fillna(df['Bare_Nuclei'].mean(), inplace=True)
X = df[names[1:-1]].values # input matrix
y = df['Class'].values // 2 - 1 # target vector
tr, te = next(StratifiedShuffleSplit(test_size=0.3, random_state=42).split(X, y))
cl = GradientBoostingClassifier(random_state=42)
cl.fit(X[tr], y[tr])
p = cl.predict_proba(X)[:, 1]
fpr, tpr, th = roc_curve(y[te], p[te])
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], '--')
plt.xlabel('false positive rate')
plt.xlabel('true positive rate')
Text(0.5, 0, 'true positive rate')
import pandas as pd
# Let's load the Wisconsin Breast Cancer data set!
names = [
'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape',
'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin',
'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
# Substitute the null values in the Bare_Nuclei column!
df['Bare_Nuclei'].fillna(df['Bare_Nuclei'].mean(), inplace=True)
# Extract input matrix!
X = df[names[1: -1]].values
X
array([[ 5., 1., 1., ..., 3., 1., 1.],
[ 5., 4., 4., ..., 3., 2., 1.],
[ 3., 1., 1., ..., 3., 1., 1.],
...,
[ 5., 10., 10., ..., 8., 10., 2.],
[ 4., 8., 6., ..., 10., 6., 1.],
[ 4., 8., 8., ..., 10., 4., 1.]])
X.shape
(699, 9)
# column standard deviations
X.std(axis=0)
array([2.81372582, 3.0492756 , 2.96978617, 2.85333603, 2.21271541,
3.59927429, 2.43661945, 3.05144882, 1.7138507 ])
Exercise 1: Implement K-means clustering from scratch and test it on the Wisconsin Breast Cancer data set!
K = 5 # number of clusters
n_iter = 100
# randomly draw K cluster centers
import numpy as np
rs = np.random.RandomState(42)
C = X[rs.permutation(X.shape[0])[:K]]
for i in range(n_iter):
# compute nearest center for each data point
clusters = np.array([((C - xi)**2).sum(axis=1).argmin() for xi in X])
# re-compute cluster centers
for k in range(K):
C[k] = X[clusters == k].mean(axis=0)
clusters
array([1, 4, 1, 4, 1, 3, 0, 2, 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 4, 1, 4, 3,
1, 4, 2, 4, 1, 1, 2, 2, 1, 2, 3, 2, 1, 2, 3, 1, 4, 4, 3, 4, 3, 4,
3, 2, 4, 2, 1, 4, 4, 0, 4, 3, 4, 4, 3, 1, 4, 1, 4, 2, 3, 1, 2, 1,
1, 4, 3, 2, 1, 3, 2, 4, 4, 2, 2, 1, 0, 2, 2, 1, 1, 1, 3, 3, 4, 3,
1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 3, 3, 4, 0, 1, 1, 3, 1, 4, 3, 2, 4,
2, 4, 4, 3, 0, 0, 1, 3, 2, 1, 2, 1, 3, 4, 3, 2, 4, 1, 4, 2, 1, 2,
3, 1, 1, 1, 1, 1, 1, 0, 1, 2, 4, 0, 2, 0, 4, 2, 1, 3, 2, 4, 3, 1,
2, 4, 2, 2, 0, 3, 4, 1, 1, 0, 1, 1, 3, 3, 1, 2, 1, 2, 2, 3, 4, 3,
2, 3, 1, 4, 2, 2, 1, 3, 4, 2, 3, 3, 3, 2, 3, 3, 1, 2, 1, 1, 4, 1,
2, 1, 3, 4, 2, 1, 2, 3, 3, 2, 2, 1, 3, 3, 2, 3, 3, 3, 2, 2, 3, 1,
2, 3, 0, 4, 4, 2, 4, 3, 2, 3, 4, 3, 1, 4, 1, 0, 3, 3, 3, 4, 1, 1,
2, 0, 2, 1, 3, 4, 1, 0, 2, 4, 4, 3, 3, 4, 1, 1, 1, 4, 4, 3, 3, 4,
3, 1, 4, 4, 3, 2, 4, 1, 4, 1, 1, 0, 1, 2, 2, 4, 1, 2, 4, 4, 4, 3,
3, 1, 4, 3, 2, 2, 3, 4, 0, 4, 4, 1, 1, 4, 3, 2, 3, 2, 4, 4, 2, 2,
3, 0, 2, 2, 3, 2, 2, 3, 4, 3, 2, 4, 4, 0, 1, 4, 2, 1, 4, 2, 3, 4,
4, 1, 1, 4, 4, 2, 4, 2, 2, 4, 4, 2, 2, 2, 3, 2, 1, 2, 0, 4, 1, 2,
4, 3, 2, 1, 1, 3, 3, 4, 3, 4, 0, 0, 2, 2, 3, 3, 2, 2, 1, 2, 1, 1,
1, 2, 2, 2, 1, 1, 2, 4, 1, 2, 2, 1, 4, 1, 2, 1, 2, 3, 1, 2, 2, 1,
1, 1, 1, 2, 3, 1, 1, 0, 2, 2, 1, 2, 2, 1, 2, 0, 3, 1, 4, 0, 4, 2,
1, 2, 0, 3, 1, 1, 1, 3, 1, 3, 2, 2, 2, 1, 1, 1, 4, 4, 3, 1, 1, 1,
4, 0, 0, 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 3, 2, 1, 4, 3, 1, 1, 1, 0,
1, 1, 1, 3, 4, 4, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 3,
1, 0, 1, 3, 4, 1, 2, 4, 1, 3, 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,
3, 0, 1, 2, 2, 1, 1, 1, 3, 3, 2, 2, 1, 4, 2, 1, 4, 4, 1, 1, 1, 1,
1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 1, 3,
1, 2, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 3, 1, 1, 4, 3, 4, 3,
1, 2, 3, 1, 1, 2, 2, 2, 1, 3, 3, 1, 1, 2, 3, 1, 3, 1, 3, 4, 4, 1,
4, 1, 1, 1, 1, 1, 1, 2, 1, 3, 4, 3, 1, 2, 3, 1, 4, 3, 3, 2, 2, 1,
1, 0, 1, 1, 1, 1, 1, 2, 1, 0, 3, 0, 2, 1, 1, 1, 2, 3, 1, 1, 3, 1,
1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 3, 1, 0, 2, 1, 1, 1, 1, 1, 1, 4, 2,
2, 1, 2, 2, 1, 2, 1, 1, 3, 3, 3, 1, 2, 1, 2, 1, 2, 1, 2, 2, 3, 3,
1, 2, 2, 2, 2, 1, 1, 2, 2, 3, 1, 1, 1, 2, 3, 3, 3])
Exercise 2: Visualize the clusters on a scatter plot of the data!
# apply t-SNE on the data matrix X
from sklearn.manifold import TSNE
Z = TSNE(random_state=42).fit_transform(X)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
for k in range(K):
plt.scatter(Z[clusters == k, 0], Z[clusters == k, 1], alpha=0.7)
Exercise 3: Repeat the previous experiment using scikit learn's built in KMeans class!
from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init='random', n_init=1, random_state=42)
clusters = km.fit(X).predict(X)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
for k in range(K):
plt.scatter(Z[clusters == k, 0], Z[clusters == k, 1], alpha=0.7)